Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos and animations address the aforementioned shortcomings, but it is extremely tedious and requires tight collaboration between movie makers and 3D rendering experts. In this paper, we introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video that best aligns with the given description. Based on film making inspiration and augmented with Blender-based movie making knowledge, the Director agent decomposes the input text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on customized function composing and API calling. Then, the Reviewer agent, augmented with knowledge of video reviewing, character motion coordinates, and intermediate screenshots uses its compositional reasoning ability to provide feedback to the Programmer agent. The Programmer agent iteratively improves the scripts to yield the best overall video outcome. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance. Moreover, our framework outperforms other approaches in a comprehensive user study on quality, consistency, and rationality.

翻译：文本到视频生成领域目前主要由端到端的基于扩散模型或自回归模型主导。一方面，这些新型模型展现出较强的通用性，但在物理准确性、光影渲染、摄像机运动及时间一致性方面常受诟病。另一方面，电影工业依赖使用三维建模软件手动编辑的计算机生成图像（CGI）。人工指导的三维合成视频与动画虽能解决上述缺陷，但其制作过程极其繁琐，且需要电影制作人与三维渲染专家的紧密协作。本文提出一种基于视觉大语言模型（VLM）智能体协同的自动合成视频生成流程。给定视频的自然语言描述，多个VLM智能体将自动指导生成流程中的各个子过程。它们通过协作生成Blender脚本，进而渲染出最符合给定描述的视频。基于电影制作理念并融合Blender影视制作知识，导演智能体将基于文本的视频描述分解为若干子过程。针对每个子过程，程序员智能体通过定制化函数组合与API调用生成基于Python的Blender脚本。随后，具备视频审阅知识、角色运动坐标解析及中间截图分析能力的评审智能体，运用其组合推理能力向程序员智能体提供反馈。程序员智能体通过迭代优化脚本，最终生成最佳整体视频效果。在视频质量与指令遵循性能的5项指标评估中，本方法生成的视频质量优于商业视频生成模型。此外，在质量、一致性与合理性综合用户研究中，本框架显著优于其他对比方法。