Most of the existing audio-driven 3D facial animation methods suffered from the lack of detailed facial expression and head pose, resulting in unsatisfactory experience of human-robot interaction. In this paper, a novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. To synthesize real and detailed expression, a hierarchical decomposition strategy is proposed to encode the audio signal into both a global latent feature and a local vertex-wise control feature. Then the local and global audio features combined with vertex spatial features are used to predict the final consistent facial animation via a graph convolutional neural network by fusing the intrinsic spatial topology structure of the face model and the corresponding semantic feature of the audio. To accomplish pose-controllable animation, we introduce a novel pose attribute augmentation method by utilizing the 2D talking face technique. Experimental results indicate that the proposed method can produce more realistic facial expressions and head posture movements. Qualitative and quantitative experiments show that the proposed method achieves competitive performance against state-of-the-art methods.
翻译:现有基于音频驱动的三维面部动画合成方法大多缺乏细致的面部表情和头部姿态控制,导致人机交互体验不佳。本文提出了一种新颖的姿态可控三维面部动画合成方法,通过利用层次化音频-顶点注意力机制实现。为合成真实且细腻的表情,我们提出层次化解耦策略,将音频信号编码为全局潜在特征和局部逐顶点控制特征。随后,结合局部与全局音频特征及顶点空间特征,通过融合面部模型固有空间拓扑结构与音频对应语义特征的图卷积神经网络,预测最终的连贯面部动画。为实现姿态可控动画,我们引入一种新颖的姿态属性增强方法,利用二维说话人脸技术。实验结果表明,所提方法能生成更逼真的面部表情和头部姿态运动。定性及定量实验显示,本方法在性能上达到与现有最优方法相竞争的水平。