PLAICraft：面向具身智能的大规模时间对齐视觉-语音-动作数据集 (PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI)

Advances in deep generative modeling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.

翻译：深度生成建模的进展使得训练人类水平的具身智能体日益可行。然而，由于缺乏能够反映自然环境感知运动复杂性的大规模、实时、多模态且具有社会交互性的数据集，相关进展受到限制。为此，我们提出了PLAICraft——一个新颖的数据采集平台与数据集，它捕获了多玩家《我的世界》交互中的五种时间对齐模态：视频、游戏输出音频、麦克风输入音频、鼠标及键盘动作。每种模态均以毫秒级时间精度记录，从而支持在丰富开放世界中研究同步的具身行为。该数据集包含来自全球超过10,000名参与者的逾10,000小时游戏实录。除数据集外，我们还提供了一套评估工具集，用于基准测试模型在物体识别、空间感知、语言接地及长期记忆等方面的能力。PLAICraft为训练和评估能够在实时环境中流畅且有目的地行动的智能体开辟了道路，为真正具身人工智能的实现铺平了道路。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

OpenEarthAgent：一种面向工具增强型地理空间智能体的统一框架

专知会员服务

11+阅读 · 2月20日

在从交互中学习时代面向大语言模型智能体的可扩展环境：综述

专知会员服务

22+阅读 · 2025年11月15日

视觉-语言-动作（VLA）模型的前世今生

专知会员服务

20+阅读 · 2025年8月29日

Agent AI：多模态交互的新地平线

专知会员服务

21+阅读 · 2025年5月26日