Due to their significance in human communication, the automatic generation of co-speech gestures in artificial embodied agents has received a lot of attention. Although modern deep learning approaches can generate realistic-looking conversational gestures from spoken language, they often lack the ability to convey meaningful information and generate contextually appropriate gestures. This paper presents an augmented approach to the generation of co-speech gestures that additionally takes into account given form and meaning features for the gestures. Our framework effectively acquires this information from a small corpus with rich semantic annotations and a larger corpus without such information. We provide an analysis of the effects of distinctive feature targets and we report on a human rater evaluation study demonstrating that our framework achieves semantic coherence and person perception on the same level as human ground truth behavior. We make our data pipeline and the generation framework publicly available.
翻译:由于共语手势在人类交流中的重要性,自动生成人工具身代理体中的共语手势已受到广泛关注。尽管现代深度学习方法能够从口语中生成逼真的对话手势,但这些方法往往缺乏传达有意义信息的能力,且难以产生符合上下文语境的手势。本文提出一种增强型共语手势生成方法,该方法额外考虑了手势的给定形式与意义特征。我们的框架能从包含丰富语义标注的小型语料库以及缺乏此类信息的较大语料库中有效获取这些信息。我们分析了不同特征目标的影响,并报告了一项人类评分员评估研究,结果表明我们的框架在语义连贯性和人格感知方面达到了与人类真实行为相同的水平。我们将数据管道和生成框架公开发布。