Advances in deep generative modeling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.
翻译:深度生成建模的进展使得训练人类水平的具身智能体日益可行。然而,由于缺乏能够反映自然环境感知运动复杂性的大规模、实时、多模态且具有社会交互性的数据集,相关进展受到限制。为此,我们提出了PLAICraft——一个新颖的数据采集平台与数据集,它捕获了多玩家《我的世界》交互中的五种时间对齐模态:视频、游戏输出音频、麦克风输入音频、鼠标及键盘动作。每种模态均以毫秒级时间精度记录,从而支持在丰富开放世界中研究同步的具身行为。该数据集包含来自全球超过10,000名参与者的逾10,000小时游戏实录。除数据集外,我们还提供了一套评估工具集,用于基准测试模型在物体识别、空间感知、语言接地及长期记忆等方面的能力。PLAICraft为训练和评估能够在实时环境中流畅且有目的地行动的智能体开辟了道路,为真正具身人工智能的实现铺平了道路。