Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.
翻译:伴随言语的手势是对话中的重要模态,可提供语境与社会线索。在角色动画中,恰当且同步的手势能增强真实感,并使交互式代理更具吸引力。历史上,自动手势生成方法主要依赖音频驱动,利用音频信号中编码的韵律和言语相关内容。本文转而探索使用基于LLAMA2从文本中提取的大语言模型特征进行手势生成。我们将其与音频特征进行对比,并通过客观测试与用户研究探索两种模态的结合效果。令人惊讶的是,我们的结果表明,LLAMA2特征单独使用时显著优于音频特征,而同时包含两种模态与仅使用LLAMA2特征相比未产生显著差异。我们证明,基于LLAMA2的模型无需任何音频输入即可生成节拍手势和语义手势,这表明大语言模型可提供非常适合手势生成的丰富编码。