This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric videos of dynamic humans from sparse view videos in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning. Experiments demonstrate that our model can be learned 100 times faster than prior per-scene optimization methods while being competitive in the rendering quality. Training our model on a $512 \times 512$ video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code will be released on our $\href{https://zju3dv.github.io/instant_nvr}{project~page}$.
翻译:本文针对从稀疏多视角视频中快速重建动态人体自由视角视频的挑战。近期研究将动态人体表示为规范神经辐射场(NeRF)与运动场,并通过可微渲染从视频中学习这些表示。然而,每场景优化通常需要数小时。其他可泛化NeRF模型利用数据集的先验知识,通过仅对新场景进行微调来减少优化时间,但会牺牲视觉保真度。本文提出一种新方法,能够在数分钟内从稀疏视角视频学习动态人体的神经体积视频,并保持具有竞争力的视觉质量。具体而言,我们定义了一种新颖的基于部件的体素化人体表示,以更好地将网络表示能力分配到不同人体部位。此外,我们提出了一种新颖的二维运动参数化方案,以提高形变场学习的收敛速度。实验表明,我们的模型学习速度比先前每场景优化方法快100倍,同时渲染质量具有竞争力。在包含100帧的$512 \times 512$视频上训练模型,单张RTX 3090 GPU通常仅需约5分钟。代码将发布在项目页面$\href{https://zju3dv.github.io/instant_nvr}{项目主页}$上。