Creating controllable 3D human portraits from casual smartphone videos is highly desirable due to their immense value in AR/VR applications. The recent development of 3D Gaussian Splatting (3DGS) has shown improvements in rendering quality and training efficiency. However, it still remains a challenge to accurately model and disentangle head movements and facial expressions from a single-view capture to achieve high-quality renderings. In this paper, we introduce Rig3DGS to address this challenge. We represent the entire scene, including the dynamic subject, using a set of 3D Gaussians in a canonical space. Using a set of control signals, such as head pose and expressions, we transform them to the 3D space with learned deformations to generate the desired rendering. Our key innovation is a carefully designed deformation method which is guided by a learnable prior derived from a 3D morphable model. This approach is highly efficient in training and effective in controlling facial expressions, head positions, and view synthesis across various captures. We demonstrate the effectiveness of our learned deformation through extensive quantitative and qualitative experiments. The project page can be found at http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html
翻译:摘要:从随意拍摄的智能手机视频中创建可控的3D人体肖像因其在增强现实/虚拟现实应用中的巨大价值而备受期待。近年来,三维高斯泼溅(3DGS)的发展在渲染质量和训练效率上取得了显著提升。然而,从单视角拍摄中精确建模和解耦头部运动与面部表情以实现高质量渲染仍是一项挑战。本文提出Rig3DGS以应对这一挑战。我们使用正则空间中的一组三维高斯体来表示整个场景(包括动态主体)。通过一组控制信号(如头部姿态和表情),我们利用学习到的形变将其转换至三维空间,以生成所需的渲染结果。我们的核心创新在于一种精心设计的形变方法,该方法由从三维可变形模型推导出的可学习先验引导。这一方法在训练中高效,并能有效控制各类拍摄场景中的面部表情、头部位置及视角合成。我们通过大量的定量和定性实验证明了所学形变的有效性。项目页面见http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html。