Listeners use short interjections, so-called backchannels, to signify attention or express agreement. The automatic analysis of this behavior is of key importance for human conversation analysis and interactive conversational agents. Current state-of-the-art approaches for backchannel analysis from visual behavior make use of two types of features: features based on body pose and features based on facial behavior. At the same time, transformer neural networks have been established as an effective means to fuse input from different data sources, but they have not yet been applied to backchannel analysis. In this work, we conduct a comprehensive evaluation of multi-modal transformer architectures for automatic backchannel analysis based on pose and facial information. We address both the detection of backchannels as well as the task of estimating the agreement expressed in a backchannel. In evaluations on the MultiMediate'22 backchannel detection challenge, we reach 66.4% accuracy with a one-layer transformer architecture, outperforming the previous state of the art. With a two-layer transformer architecture, we furthermore set a new state of the art (0.0604 MSE) on the task of estimating the amount of agreement expressed in a backchannel.
翻译:聆听者使用简短插话(即所谓回馈)来示意关注或表达同感。自动分析这一行为对于人类对话分析与交互式对话代理至关重要。当前基于视觉行为的回馈分析最先进方法主要利用两类特征:基于身体姿态的特征与基于面部行为的特征。与此同时,Transformer神经网络已被验证为融合不同数据源输入的有效工具,但尚未应用于回馈分析领域。本研究基于姿态与面部信息,对用于自动回馈分析的多模态Transformer架构进行了全面评估。我们同步处理回馈检测任务以及评估回馈中表达的同感程度任务。在MultiMediate'22回馈检测挑战赛的评估中,采用单层Transformer架构实现了66.4%的准确率,超越此前最先进水平。通过双层Transformer架构,我们更在回馈同感程度估计任务中刷新了最先进水平(MSE为0.0604)。