Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.
翻译:共语手势能增强参与感并改善言语理解。大多数数据驱动的机器人系统生成节奏性的节拍式动作,但鲜有系统整合语义强调功能。为解决这一问题,我们提出了一种轻量级Transformer,其仅从文本和情感中推导示意性手势的布局与强度,在推理时无需音频输入。该模型在BEAT2数据集上的语义手势布局分类与强度回归任务中均优于GPT-4o,同时保持计算紧凑性,适用于具身智能体的实时部署。