Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape
翻译:近期世界模型的研究进展展示了其在模拟物理现实方面的强大能力,使其成为具身智能日益重要的基础。特别是对于无人机智能体而言,在无约束环境中精准预测复杂三维动态对于自主导航和鲁棒决策至关重要。然而,在无人机视角典型的高度动态相机轨迹下,现有世界模型往往难以维持时空物理一致性。关键原因在于当前训练数据的分布偏差:大多数现有数据集仅呈现受限的2.5维运动模式,例如地面受限的自动驾驶场景或相对平滑的以人为中心的自我中心视频,因此缺乏真实高动态6自由度无人机运动先验。为填补这一空白,我们提出MotionScape——一个用于世界建模的大规模真实世界无人机视角视频数据集,具有高度动态运动特性。MotionScape包含超过30小时时长的4K无人机视角视频,总计超过450万帧。该新颖数据集的特征在于语义与几何对齐的训练样本,其中多样化的真实世界无人机视频与精确的6自由度相机轨迹及细粒度自然语言描述紧密耦合。为构建该数据集,我们开发了一种自动化多阶段处理流水线,集成了基于CLIP的相关性过滤、时间分割、用于轨迹恢复的鲁棒视觉SLAM以及大语言模型驱动的语义标注。大量实验表明,融入此类语义与几何对齐标注可有效提升现有世界模型模拟复杂三维动态及处理大视角变化的能力,从而有益于无人机智能体在复杂环境中的决策与规划。该数据集已在https://github.com/Thelegendzz/MotionScape 公开提供。