EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

翻译：视频生成已成为一种前景广阔的世界模拟工具，它利用视觉数据来复现真实世界环境。在此背景下，以人类视角为中心的第一人称视频生成，在增强虚拟现实、增强现实和游戏应用方面具有巨大潜力。然而，由于第一人称视角的动态性、动作的复杂多样性以及所遇场景的复杂多变，第一人称视频的生成面临着重大挑战。现有数据集不足以有效应对这些挑战。为弥补这一空白，我们提出了EgoVid-5M，这是首个专门为第一人称视频生成而构建的高质量数据集。EgoVid-5M包含500万个第一人称视频片段，并附有详细的动作标注，包括细粒度的运动学控制和高层次的文本描述。为确保数据集的完整性和可用性，我们实施了一套复杂的数据清洗流程，旨在保持第一人称条件下的帧一致性、动作连贯性和运动平滑性。此外，我们提出了EgoDreamer模型，它能够同时根据动作描述和运动学控制信号生成第一人称视频。EgoVid-5M数据集、相关的动作标注以及所有数据清洗元数据将被公开，以推动第一人称视频生成领域的研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日