Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.
翻译:自动合成逼真的共语手势是人工具身智能体创建中日益重要且具有挑战性的任务。以往系统主要采用端到端方式生成手势,但由于语音与手势之间存在复杂而微妙的协调关系,难以清晰挖掘其中的节奏与语义信息。我们提出了一种新型共语手势合成方法,在节奏与语义两方面均取得了令人信服的结果。在节奏方面,我们的系统包含一个鲁棒的基于节奏的分割流程,以确保发声与手势之间的显式时间一致性。在语义方面,我们基于语言学理论设计了一种机制,有效解耦语音与运动中的低层次与高层次神经嵌入。高层次嵌入对应语义,而低层次嵌入则关联细微变化。最后,我们构建了语音与运动层次嵌入之间的对应关系,实现了兼顾节奏与语义感知的手势合成。基于现有客观指标、新提出的节奏指标及人类反馈的评估表明,我们的方法显著优于现有最优系统。