This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transforms text or speech into a sequence of animation parameters. In contrast to previous approaches, our animation model learns disentangling/synthesizing different acting-styles in an unsupervised manner, requiring only phonetic labels that describe the content of training sequences. For realistic real-time rendering, we train a U-Net that refines rasterization-based renderings by computing improved pixel colors and a foreground matte. We compare our framework qualitatively/quantitatively against recent methods for head modeling as well as facial animation and evaluate the perceived rendering/animation quality in a user-study, which indicates large improvements compared to state-of-the-art approaches
翻译:本文提出了一种新颖的文本/语音驱动方法,用于实现基于混合变形几何、动态纹理和神经渲染的照片级逼真头部模型动画。通过训练几何与纹理变分自编码器,我们获得了一个参数化模型,能够从潜在特征向量中精确捕捉并逼真合成面部表情。我们的动画方法基于条件卷积神经网络,将文本或语音转化为动画参数序列。与以往方法不同,本动画模型以无监督方式学习解耦/合成不同表演风格,仅需描述训练序列内容的语音标签即可。为实现逼真的实时渲染,我们训练了一个U-Net网络,通过计算改进的像素颜色和前景蒙版来优化基于光栅化的渲染结果。本研究从定性与定量角度,将所提框架与最新的头部建模及面部动画方法进行对比,并通过用户研究评估感知渲染与动画质量。结果表明,与现有最先进方法相比,本方案实现了显著改进。