We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training, and are extremely slow at inference time. Recently, the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training, these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation, we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices, enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input, while being 400x and 250x faster in training and inference, respectively.
翻译:我们提出了一种利用三维高斯泼溅(3DGS)从单目视频创建可动人体化身的方法。现有基于神经辐射场(NeRF)的方法虽能实现高质量的新视角/新姿态图像合成,但通常需要数天训练且推理速度极慢。近期,学界探索了利用快速网格结构来高效训练着装化身的方法。尽管训练速度极快,但这些方法几乎无法达到约15 FPS的交互式渲染帧率。本文采用三维高斯泼溅技术,通过学习非刚性变形网络,在30分钟内即可完成可动着装人体化身的训练,并实现实时帧率(50+ FPS)的渲染。由于我们采用显式表征,进一步在高斯均值向量和协方差矩阵上引入了尽可能等距的正则化约束,增强了模型对高度关节化未见过姿态的泛化能力。实验结果表明,在单目输入创建可动化身任务中,我们的方法达到了与最先进方法相当甚至更优的性能,同时训练速度提升400倍、推理速度提升250倍。