Our work focuses on the social reasoning capabilities of foundation models for real-world human-robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a benchmark of $\sim$400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human-human interactions, the SHREC Dataset uniquely highlights the social challenges faced by real-world social robots such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundation models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models' capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundation models, alongside human evaluations, reveal substantial performance gaps -- underscoring the difficulty and providing directions in developing socially intelligent AI.
翻译:本研究聚焦于基础模型在真实人机交互场景中的社交推理能力。我们提出了社会人机具身对话(SHREC)数据集,该基准包含约400个真实人机交互视频及超过10,000条标注,涵盖机器人的社交错误、胜任能力、潜在动机与修正策略。与以往聚焦人人交互的数据集不同,SHREC数据集独特地凸显了现实社交机器人面临的情绪理解、意图追踪及对话机制等社会性挑战。此外,当前基础模型难以识别这些表现为细微社会情境化失败的缺陷。为评估人工智能模型的社交推理能力,我们定义了八项基准任务,聚焦四大关键领域:(1)社交错误与胜任能力检测,(2)潜在社交属性识别,(3)交互流程理解,以及(4)提供动机与替代正确动作。基于最先进基础模型的实验与人类评估结果揭示了显著的性能差距——这既突显了挑战的艰巨性,也为发展具有社交智能的人工智能指明了方向。