How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.
翻译:如何使智能体能够预测其在三维空间中自身运动意图的结果,一直是具身智能领域的一个基本问题。为探索通用的空间想象能力,我们提出了AirScape,这是首个为六自由度空中智能体设计的世界模型。AirScape能够基于当前视觉输入和运动意图预测未来的观测序列。具体而言,我们构建了一个用于空中世界模型训练和测试的数据集,该数据集包含11k个视频-意图对。这些第一人称视角视频捕捉了无人机在广泛场景下的多样化动作,并耗费超过1,000小时标注了相应的运动意图。随后,我们设计了一个两阶段训练方案,将一个最初不具备具身空间知识的基础模型,训练成一个可由运动意图控制并遵循物理时空约束的世界模型。实验结果表明,AirScape在3D空间想象能力上显著优于现有的基础模型,尤其是在反映运动对齐的指标上实现了超过50%的提升。项目地址为:https://embodiedcity.github.io/AirScape/。