Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task instructor watches the performer's egocentric video in real time and guides them verbally. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/.
翻译:摘要:构建能够在真实世界中感知、推理并与人类协作的交互式AI助手,一直是人工智能领域的长期追求。本研究属于更广泛研究工作的一部分,旨在开发能够以交互方式引导人类在物理世界中完成任务的智能体。作为迈向这一方向的第一步,我们引入了HoloAssist——一个大规模自我中心人类交互数据集,其中两人协作完成物理操作任务。任务执行者佩戴混合现实头显执行任务,该设备可捕获七种同步数据流。任务指导者实时观看执行者的自我中心视频,并通过语言进行引导。通过为数据添加动作与对话注释,并观察不同参与者的丰富行为,我们揭示了人类助手如何纠正错误、干预任务完成流程以及将指令锚定到环境的关键洞见。HoloAssist包含由350个独特指导者-执行者对捕获的166小时数据。此外,我们构建并提出了关于错误检测、干预类型预测以及手部预测的基准测试,并附有详细分析。我们预期HoloAssist将为构建能够在真实世界中与人类流畅协作的AI助手提供重要资源。数据可于https://holoassist.github.io/下载。