As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.
翻译:随着文本转语音技术在朗读任务中达到显著的自然度,人们对言语与非言语交际行为(如自发性语音及相关身体手势)的多模态合成兴趣日益增长。本文提出一种新颖的统一架构,用于从文本联合合成语音声学特征与基于骨架的3D手势运动,该架构通过最优传输条件流匹配(OT-CFM)进行训练。所提出的架构比现有最先进方法更简单,内存占用更小,且能捕捉语音与手势的联合分布,通过单一过程同时生成两种模态。与此同时,新的训练机制使得在更少的步骤(网络评估次数)内即可达到比以往更优的合成质量。单模态与多模态主观测试表明,与现有基准相比,该方法在语音自然度、手势拟人度和跨模态适配性方面均有提升。