Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
翻译:伴随语音的手势生成通过语音同步手势合成增强了人机交互的真实感。然而,生成具有语义意义的手势仍然是一个具有挑战性的问题。我们提出了SARGes,一种新颖的框架,该框架利用大语言模型解析语音内容并生成可靠的语义手势标签,随后指导有意义的伴随语音手势的合成。首先,我们构建了一个全面的伴随语音手势行为谱,并开发了一种基于大语言模型的意图链推理机制,该机制根据行为谱标准系统性地解析手势语义并将其分解为结构化的推理步骤,从而有效引导大语言模型生成上下文感知的手势标签。随后,我们构建了一个带有意图链标注的文本到手势标签数据集,并训练了一个轻量级的手势标签生成模型,该模型随后指导生成可信且语义连贯的伴随语音手势。实验结果表明,SARGes实现了高度语义对齐的手势标注(50.2%的准确率)和高效的单次推理(0.4秒)。所提出的方法为语义手势合成提供了一条可解释的意图推理路径。