A key component of dyadic spoken interactions is the contextually relevant non-verbal gestures, such as head movements that reflect a listener's response to the interlocutor's speech. Although significant progress has been made in the context of generating co-speech gestures, generating listener's response has remained a challenge. We introduce the task of generating continuous head motion response of a listener in response to the speaker's speech in real time. To this end, we propose a graph-based end-to-end crossmodal model that takes interlocutor's speech audio as input and directly generates head pose angles (roll, pitch, yaw) of the listener in real time. Different from previous work, our approach is completely data-driven, does not require manual annotations or oversimplify head motion to merely nods and shakes. Extensive evaluation on the dyadic interaction sessions on the IEMOCAP dataset shows that our model produces a low overall error (4.5 degrees) and a high frame rate, thereby indicating its deployability in real-world human-robot interaction systems. Our code is available at - https://github.com/bigzen/Active-Listener
翻译:双人口语交互的一个关键组成部分是情境相关的非语言手势,例如反映聆听者对交谈者言语反应的头部运动。尽管在生成伴随语音手势方面已取得显著进展,但生成聆听者反应仍是一个挑战。我们提出了实时生成聆听者针对说话者语音的连续头部运动响应的任务。为此,我们提出了一种基于图的端到端跨模态模型,该模型以交谈者的语音音频作为输入,直接实时生成聆听者的头部姿态角(滚转、俯仰、偏航)。与先前工作不同,我们的方法完全由数据驱动,无需人工标注,也不将头部运动过度简化为仅有点头和摇头。在IEMOCAP数据集的双人交互会话上进行广泛评估表明,我们的模型产生了较低的整体误差(4.5度)和高帧率,从而表明其在实际人机交互系统中的可部署性。我们的代码发布于:https://github.com/bigzen/Active-Listener