This paper aims to tackle the challenge of dynamic view synthesis from multi-view videos. The key observation is that while previous grid-based methods offer consistent rendering, they fall short in capturing appearance details of a complex dynamic scene, a domain where multi-view image-based rendering methods demonstrate the opposite properties. To combine the best of two worlds, we introduce Im4D, a hybrid scene representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation. Specifically, the dynamic geometry is encoded as a 4D density function composed of spatiotemporal feature planes and a small MLP network, which globally models the scene structure and facilitates the rendering consistency. We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features, instead of memorizing detailed appearance totally with networks, thereby naturally making the learning of networks easier. Our method is evaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D exhibits state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images, on a single RTX 3090 GPU.
翻译:本文旨在解决多视角视频中动态视图合成的挑战。关键发现是,虽然基于网格的方法能提供一致的渲染效果,但在捕捉复杂动态场景的外观细节方面存在不足,而多视图图像渲染方法在此方面展现出相反的特性。为融合两者优势,我们提出Im4D——一种混合场景表示方法,包含基于网格的几何表示与多视图图像外观表示。具体而言,动态几何通过由时空特征平面和小型MLP网络组成的4D密度函数编码,全局建模场景结构并提升渲染一致性。我们采用原始多视角视频和从图像特征预测三维点颜色的网络表示场景外观,而非完全依赖网络记忆细节外观,从而自然降低网络学习难度。本方法在DyNeRF、ZJU-MoCap、NHR、DNA-Rendering和ENeRF-Outdoor五个动态视图合成数据集上进行了评估。结果表明,Im4D在渲染质量上达到最先进水平,训练效率高,且能在单张RTX 3090 GPU上以79.8 FPS的速度实时渲染512×512图像。