Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
翻译:近年来,二维视觉生成领域取得了显著成功。然而,由于缺乏大规模4D数据及有效的模型设计,3D与4D生成在现实应用中仍面临挑战。本文提出通过利用日常生活中常见的相机与物体运动,对通用3D与4D生成进行联合研究。针对学界缺乏真实世界4D数据的问题,我们首先提出一种数据构建流程,用于从视频中提取相机位姿与物体运动强度。基于此流程,我们构建了一个大规模真实世界4D场景数据集:CamVid-30K。通过综合利用所有3D与4D数据,我们开发了GenXD框架,该框架能够生成任意3D或4D场景。我们提出了解耦相机运动与物体运动的多视角时序模块,以实现对3D与4D数据的无缝联合学习。此外,GenXD采用掩码潜在条件机制以支持多种条件视图输入。GenXD能够生成遵循相机轨迹的视频,同时可输出能转换为3D表征的一致性多视角图像。我们在多个真实世界与合成数据集上进行了广泛评估,结果表明相较于现有方法,GenXD在3D与4D生成任务中具有卓越的有效性与通用性。