StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

翻译：人工智能生成内容（AIGC）的出现推动了自动化视频生成的研究，旨在简化传统制作流程。然而，自动化叙事视频生产，尤其是针对定制化叙事，由于需要在不同镜头间保持主体一致性的复杂性，仍然面临挑战。尽管现有方法如Mora和AesopAgent通过集成多个智能体进行故事到视频（S2V）生成，但在保持主角一致性和支持定制化叙事视频生成（CSVG）方面仍显不足。为应对这些局限，我们提出了StoryAgent，一个专为CSVG设计的多智能体框架。StoryAgent将CSVG分解为分配给专门智能体的不同子任务，模拟了专业制作流程。值得注意的是，我们的框架包含故事设计、分镜生成、视频创作、智能体协调和结果评估等智能体。通过利用不同模型的优势，StoryAgent增强了对生成过程的控制，显著提升了角色一致性。具体而言，我们引入了一种定制化的图像到视频（I2V）方法LoRA-BE，以增强镜头内的时间一致性；同时提出了一种新颖的分镜生成流程，以保持镜头间的主体一致性。大量实验证明了我们的方法在合成高一致性叙事视频方面的有效性，其性能优于现有最先进方法。我们的贡献包括引入了StoryAgent这一适用于视频生成任务的通用框架，以及用于保持主角一致性的新技术。