Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.
翻译:近期说话头合成研究通常采用从大规模预训练声学模型中提取的语音特征。然而,语音与唇部运动之间固有的多对多关系导致音素-视位对齐存在模糊性,从而产生不准确且不稳定的唇形。为进一步提升唇形同步精度,我们提出PASE(音素感知语音编码器),这是一种新颖的语音表征模型,旨在弥合音素与视位之间的鸿沟。PASE显式引入音素嵌入作为对齐锚点,并采用对比对齐模块以增强对应视听配对间的可区分性。此外,本研究设计了预测与重建任务以提升模型在噪声和部分模态缺失情况下的鲁棒性。实验结果表明,PASE显著提升了唇形同步精度,在基于NeRF和3DGS的渲染框架中均达到最先进性能,分别超越基于传统声学特征的方法13.7%和14.2%。重要的是,PASE能够无缝集成到多种说话头生成流程中,在不改变架构的前提下有效提升唇形同步精度。