StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

翻译：人工智能生成内容（AIGC）的出现推动了自动化视频生成的研究，旨在简化传统制作流程。然而，自动化叙事视频生产，特别是针对定制化叙事内容，由于需要在多个镜头间保持主体一致性的复杂性，仍然面临挑战。尽管现有方法（如Mora和AesopAgent）通过集成多个智能体实现故事到视频（S2V）的生成，但在保持主角一致性和支持定制化叙事视频生成（CSVG）方面仍存在不足。为应对这些局限，我们提出了StoryAgent，一个专为CSVG设计的多智能体框架。StoryAgent将CSVG分解为不同的子任务，并分配给专业化的智能体，从而模拟专业化的制作流程。值得注意的是，我们的框架包含故事设计、分镜生成、视频创作、智能体协调与结果评估等多个智能体。通过利用不同模型的优势，StoryAgent增强了对生成过程的控制能力，显著提升了角色一致性。具体而言，我们提出了一种定制化的图像到视频（I2V）方法LoRA-BE，以增强镜头内的时间一致性；同时，提出了一种新颖的分镜生成流程，以保持跨镜头的主体一致性。大量实验证明，我们的方法在合成高一致性叙事视频方面具有显著效果，性能优于现有先进方法。我们的贡献包括提出了StoryAgent这一适用于视频生成任务的通用框架，以及一系列保持主角一致性的新技术。