Our work focuses on the social reasoning capabilities of foundation models for real-world human-robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a benchmark of $\sim$400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human-human interactions, the SHREC Dataset uniquely highlights the social challenges faced by real-world social robots such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundation models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models' capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundation models, alongside human evaluations, reveal substantial performance gaps -- underscoring the difficulty and providing directions in developing socially intelligent AI.
翻译:本研究聚焦于基础模型在真实世界人机交互中的社会推理能力。我们引入了社交人机实体对话(SHREC)数据集,该基准包含约400段真实人机交互视频及超过1万条标注,涵盖了机器人社交错误、胜任表现、潜在原理及纠正措施。与先前专注于人人交互的数据集不同,SHREC数据集独特地凸显了真实世界社交机器人面临的社会性挑战,如情感理解、意图追踪和对话机制。此外,当前基础模型难以识别这些缺陷,这些缺陷表现为微妙且具社会情境性的失败。为评估AI模型的社会推理能力,我们定义了八项基准任务,针对关键领域包括:(1)检测社交错误与胜任表现,(2)识别潜在社会属性,(3)理解交互流程,以及(4)提供原理阐述及替代正确行动。通过对最先进基础模型的实验及人工评估,揭示了显著的性能差距——这凸显了开发具备社会智能的AI所面临的困难,并指明了未来发展方向。