Humans effortlessly recognize social interactions from visual input, yet the underlying computations remain unknown, and social interaction recognition challenges even the most advanced deep neural networks (DNNs). Here, we hypothesized that humans rely on 3D visuospatial pose information to make social judgments, and that this information is largely absent from most vision DNNs. To test these hypotheses, we used a novel pose and depth estimation pipeline to automatically extract 3D body joint positions from short video clips. We compared the ability of these body joints to predict human social judgments in the videos with embeddings from over 350 vision DNNs. We found that body joints predicted social judgments better than most DNNs. We then reduced the 3D body joints to an even more compact feature set describing only the 3D position and direction of people in the videos. We found that this minimal 3D feature set, but not its 2D counterpart, was necessary and sufficient to explain the prediction performance of the full set of body joints. These minimal 3D features also predicted the extent to which DNNs aligned with human social judgments and significantly improved their performance on these tasks. Together, these findings demonstrate that human social perception depends on simple, explicit 3D pose information.
翻译:人类能够毫不费力地从视觉输入中识别社交互动,然而其背后的计算机制仍不明确,社交互动识别甚至对最先进的深度神经网络(DNNs)也构成挑战。本文假设,人类依赖三维视觉空间姿态信息进行社交判断,而这一信息在大多数视觉DNNs中基本缺失。为验证这些假设,我们采用一种新颖的姿态与深度估计流程,从短视频片段中自动提取三维人体关节位置。我们比较了这些人体关节与超过350个视觉DNNs的嵌入向量在预测视频中人类社交判断方面的能力。研究发现,人体关节在预测社交判断方面优于大多数DNNs。随后,我们将三维人体关节进一步简化为一个更紧凑的特征集,仅描述视频中人物的三维位置与朝向。结果表明,这一极简的三维特征集(而非其二维对应版本)是解释完整人体关节集合预测性能的必要且充分条件。这些极简三维特征还能预测DNNs与人类社交判断的对齐程度,并显著提升其在此类任务上的性能。综上所述,这些发现证明人类社交感知依赖于简单、明确的三维姿态信息。