This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transforms text or speech into a sequence of animation parameters. In contrast to previous approaches, our animation model learns disentangling/synthesizing different acting-styles in an unsupervised manner, requiring only phonetic labels that describe the content of training sequences. For realistic real-time rendering, we train a U-Net that refines rasterization-based renderings by computing improved pixel colors and a foreground matte. We compare our framework qualitatively/quantitatively against recent methods for head modeling as well as facial animation and evaluate the perceived rendering/animation quality in a user-study, which indicates large improvements compared to state-of-the-art approaches
翻译:本文提出了一种新颖的文本/语音驱动式面部动画方法,该方法基于混合形状几何、动态纹理及神经渲染技术实现照片级真实感头部模型的动画生成。通过对几何与纹理的变分自编码器训练,我们构建了一个参数化模型,能够从潜在特征向量中精确捕捉并逼真合成面部表情。我们的动画方法采用条件卷积神经网络,可将文本或语音转换为动画参数序列。与现有方法不同,本动画模型能以无监督方式学习解耦/合成不同表演风格,仅需描述训练序列内容的音标标签即可实现。为实现实时逼真渲染,我们训练了一个U-Net网络,通过优化像素颜色与前景遮罩来精化基于光栅化的渲染结果。我们从定性与定量两个维度将本框架与最新的头部建模及面部动画方法进行对比,并通过用户研究评估了感知渲染/动画质量,结果表明相较于现有最优方法取得了显著提升。