Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
翻译:近年来,二维视觉生成领域取得了显著进展。然而,由于缺乏大规模四维数据及有效的模型设计,三维与四维生成在现实应用中仍面临挑战。本文提出通过利用日常生活中常见的相机与物体运动,对通用三维与四维生成进行联合研究。针对学界缺乏真实世界四维数据的问题,我们首先提出一种数据构建流程,用于从视频中提取相机位姿与物体运动强度。基于此流程,我们构建了一个大规模真实世界四维场景数据集:CamVid-30K。通过整合所有三维与四维数据,我们开发了GenXD框架,该框架能够生成任意三维或四维场景。我们提出了解耦相机运动与物体运动的多视角时序模块,以无缝学习三维与四维数据。此外,GenXD采用掩码潜在条件机制,支持多种条件视图输入。该框架既能生成遵循相机轨迹的视频,也能生成可转换为三维表征的一致性三维视图。我们在多个真实世界与合成数据集上进行了广泛评估,结果表明相较于现有方法,GenXD在三维与四维生成中具有卓越的有效性与泛化能力。