Sora is the first large-scale generalist video generation model that garnered significant attention across society. Since its launch by OpenAI in February 2024, no other video generation models have paralleled {Sora}'s performance or its capacity to support a broad spectrum of video generation tasks. Additionally, there are only a few fully published video generation models, with the majority being closed-source. To address this gap, this paper proposes a new multi-agent framework Mora, which incorporates several advanced visual AI agents to replicate generalist video generation demonstrated by Sora. In particular, Mora can utilize multiple visual agents and successfully mimic Sora's video generation capabilities in various tasks, such as (1) text-to-video generation, (2) text-conditional image-to-video generation, (3) extend generated videos, (4) video-to-video editing, (5) connect videos and (6) simulate digital worlds. Our extensive experimental results show that Mora achieves performance that is proximate to that of Sora in various tasks. However, there exists an obvious performance gap between our work and Sora when assessed holistically. In summary, we hope this project can guide the future trajectory of video generation through collaborative AI agents.
翻译:Sora是首个引发社会各界广泛关注的大规模通用视频生成模型。自2024年2月由OpenAI发布以来,尚无其他视频生成模型在性能或支持视频生成任务范围上能与Sora匹敌。此外,目前仅有少数视频生成模型完全公开,大多数仍为闭源模型。为弥补这一空白,本文提出了一种新颖的多智能体框架Mora,该框架集成了多个先进的视觉AI智能体以复现Sora所展现的通用视频生成能力。具体而言,Mora能够利用多个视觉智能体,成功模拟Sora在多项任务中的视频生成能力,包括:(1)文本到视频生成、(2)文本条件控制的图像到视频生成、(3)生成视频的扩展、(4)视频到视频编辑、(5)视频连接以及(6)数字世界模拟。大量实验结果表明,Mora在各项任务中均能取得接近Sora的性能表现。然而,从整体评估来看,我们的工作与Sora之间仍存在明显差距。总而言之,我们希望本项目能够通过协作式AI智能体为视频生成的未来发展提供指引。