Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.
翻译:视频生成已成为一种前景广阔的世界模拟工具,它利用视觉数据来复现真实世界环境。在此背景下,以人类视角为中心的第一人称视频生成,在增强虚拟现实、增强现实和游戏应用方面具有巨大潜力。然而,由于第一人称视角的动态性、动作的复杂多样性以及所遇场景的复杂多变,第一人称视频的生成面临着重大挑战。现有数据集不足以有效应对这些挑战。为弥补这一空白,我们提出了EgoVid-5M,这是首个专门为第一人称视频生成而构建的高质量数据集。EgoVid-5M包含500万个第一人称视频片段,并附有详细的动作标注,包括细粒度的运动学控制和高层次的文本描述。为确保数据集的完整性和可用性,我们实施了一套复杂的数据清洗流程,旨在保持第一人称条件下的帧一致性、动作连贯性和运动平滑性。此外,我们提出了EgoDreamer模型,它能够同时根据动作描述和运动学控制信号生成第一人称视频。EgoVid-5M数据集、相关的动作标注以及所有数据清洗元数据将被公开,以推动第一人称视频生成领域的研究。