This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.
翻译:本研究探讨了两种伴随手势生成框架——AQ-GT及其语义增强变体AQ-GT-a,以评估它们通过手势传达意义的能力以及人类对生成动作的感知。利用SAGA空间通信语料库中的句子、上下文相似的句子以及新颖的以动作为焦点的句子,我们进行了以用户为中心的概念识别度和拟人化程度评估。结果显示,语义标注与性能之间存在微妙的关系。缺乏显式语义输入的原始AQ-GT框架,在其训练领域内传达概念的效果出人意料地更佳。相反,AQ-GT-a框架表现出更好的泛化能力,尤其是在新情境中表征形状和尺寸方面。尽管参与者认为AQ-GT-a生成的手势更具表现力和帮助性,但他们并未觉得这些手势更像人类手势。这些发现表明,显式的语义增强并不能保证手势生成的改进,其有效性高度依赖于具体情境,这暗示了专业化与泛化之间可能存在权衡关系。