The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.
翻译:语音表征的选择对语音驱动的3D面部动画至关重要。不同表征在编码内容上存在差异:自监督学习特征强调音段和语义线索,神经编解码器生成针对声学重建优化的潜在表征,而自动语音识别目标则产生基于标签的空间。我们评估了四种面向3D面部合成的语音表征族,通过客观指标和感知评估比较它们在两种面部解码器中的面部重建质量。此外,我们进行了探测分析,将词元化表征与语音单元及发音形变关联起来。研究发现,编码语音类别有利于在语义与标签两类表征上预测准确的面部动画,且两者面部动画质量相当。基于后者,我们提出了一种音视频文语转换流水线,该流水线利用离散表征作为共享空间,同步解码语音与3D面部运动。