Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.
翻译:非语言交流通常包含语义丰富的手势,有助于传达话语的含义。生成此类语义伴随手势对现有神经系统构成重大挑战,这些系统能够生成有节奏的节拍手势,却难以产生具有语义意义的手势。为此,我们提出RAG-Gesture——一种基于扩散的手势生成方法,该方法利用检索增强生成(RAG)技术来产生自然且语义丰富的手势。我们的神经显式手势生成方法旨在生成基于可解释语言学知识的语义手势。我们通过使用显式领域知识从伴随手势数据库中检索示例动作来实现这一目标。检索完成后,我们在推理阶段通过DDIM反演和检索引导将这些语义示例手势注入基于扩散的手势生成流程,且无需任何训练。此外,我们提出了一种引导控制范式,使用户能够调节每次检索插入对生成序列的影响程度。对比评估结果表明,相较于近期的手势生成方法,我们的方法具有显著优势。建议读者访问项目页面查看具体结果。