Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A Survey

from arxiv, This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, https://doi.org/10.1145/3626516

Automated co-located human-human interaction analysis has been addressed by the use of nonverbal communication as measurable evidence of social and psychological phenomena. We survey the computing studies (since 2010) detecting phenomena related to social traits (e.g., leadership, dominance, personality traits), social roles/relations, and interaction dynamics (e.g., group cohesion, engagement, rapport). Our target is to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings (free-standing conversations, meetings, dyads, and crowds). We also present a comprehensive summary of the related datasets and outline future research directions which are regarding the implementation of artificial intelligence, dataset curation, and privacy-preserving interaction analysis. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively; multimodal features are prominently performing better; deep learning architectures showed improved performance in overall, but there exist many phenomena whose detection has never been implemented through deep models. We also identified several limitations such as the lack of scalable benchmarks, annotation reliability tests, cross-dataset experiments, and explainability analysis.

翻译：自动化共处人-人交互分析已通过将非语言沟通作为社会和心理现象的可测量证据来实现。我们综述了自2010年以来的计算研究，这些研究检测与社会特质（如领导力、支配性、人格特质）、社会角色/关系以及交互动态（如群体凝聚力、参与度、融洽关系）相关的现象。我们的目标是识别能够实现有效性能的非语言线索和计算方法。本综述与同类研究的不同之处在于涵盖了最广泛的社会现象和交互场景（自由站立对话、会议、双人互动及人群）。我们还总结了相关数据集，并概述了未来研究方向，涉及人工智能实现、数据集整理及隐私保护交互分析。主要发现包括：最常用的非语言线索、计算方法、交互环境和感知方式分别是说话活动、支持向量机、由3-4人组成并配备麦克风和摄像头的会议；多模态特征表现显著更优；深度学习架构在整体上展现了改进的性能，但仍有众多现象的检测从未通过深度模型实现。我们还识别出若干局限性，如缺乏可扩展的基准测试、注释可靠性检验、跨数据集实验及可解释性分析。