Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
翻译:理解并复现现实世界是通用人工智能研究中的关键挑战。为实现这一目标,诸多现有方法(如世界模型)旨在捕捉支配物理世界的基本原理,从而实现更精确的模拟与有意义的交互。然而,当前方法通常将不同模态——包括二维(图像)、视频、三维及四维表征——视为独立领域,忽视了其相互依赖性。此外,这些方法往往聚焦于现实世界的孤立维度,未能系统性地整合其内在关联。本综述提出了一种面向多模态生成模型的统一综述框架,旨在探究现实世界模拟中数据维度的演进路径。具体而言,本综述从二维生成(外观)出发,继而延伸至视频生成(外观+动态)与三维生成(外观+几何),最终汇聚于整合所有维度的四维生成。据我们所知,这是在单一框架内系统性地统一二维、视频、三维及四维生成研究的首次尝试。为引导未来研究,我们全面梳理了相关数据集、评估指标与发展方向,并为新进研究者提供了启发性见解。本综述可作为在统一框架下推进多模态生成模型与真实世界模拟研究的桥梁。