We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
翻译:我们提出了《观察与描述》数据集,这是一个用于研究跨自我中心与异中心视角指称交流的多模态数据集。通过使用Meta Project Aria智能眼镜和固定摄像机,我们记录了25名参与者在指导同伴识别厨房食材时的同步视线、语音和视频数据。结合三维场景重建,该数据集为评估不同空间表征(二维与三维;自我中心与异中心视角)如何影响多模态指称理解提供了基准。数据集包含3.67小时的录制内容,涵盖2,707条带有丰富标注的指称表达式,旨在推动能够理解并参与情境对话的具身智能体的研发。