Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
翻译:人际交互研究引入了F形构的概念来描述社交互动中参与者的空间排列。本文有两个目标:旨在检测视频序列中的F形构,并预测群体对话中的下一位发言者。所提出的方法利用了视频序列中的时间信息和人类多模态信号。特别地,我们依靠测量人们的参与度作为群体归属的特征。我们的方法采用递归神经网络——长短期记忆网络(LSTM)来预测对话群体中谁将接替发言权。在MatchNMingle数据集上的实验实现了85%的群体检测真阳性率和98%的下一位发言者预测准确率。