World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.
翻译:世界模型在理解和预测世界动态方面发挥着关键作用,这对视频生成至关重要。然而,现有世界模型局限于特定场景(如游戏或驾驶),限制了其捕捉通用世界动态环境复杂性的能力。为此,我们提出WorldDreamer——一种开创性的世界模型,旨在促进对通用世界物理规律和运动模式的全面理解,从而显著增强视频生成能力。受大型语言模型成功经验的启发,WorldDreamer将世界建模建模为无监督视觉序列挑战,通过将视觉输入映射为离散标记并预测被掩码的标记来实现。在此过程中,我们引入多模态提示以促进世界模型内的交互。实验表明,WorldDreamer在自然场景和驾驶环境等多种场景下的视频生成中表现出色,并展示了其执行文本到视频转换、图像到视频合成以及视频编辑等任务的通用性。这些结果凸显了WorldDreamer在捕捉多样化通用世界环境中动态元素的有效性。