Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. Our evaluation benchmark will be publicly available at https://github.com/bilibili/Index-anisora.
翻译:近年来,动画在影视行业中获得了显著关注。尽管Sora、Kling和CogVideoX等先进视频生成模型在生成自然视频方面取得了成功,但它们在处理动画视频方面却缺乏同等效能。由于动画独特的艺术风格、对物理定律的违背以及夸张的动作,评估动画视频生成也面临巨大挑战。本文提出了一个用于动画视频生成的综合系统AniSora,该系统包括数据处理流水线、可控生成模型和评估数据集。在拥有超过1000万高质量数据的数据处理流水线支持下,生成模型引入了时空掩码模块,以支持关键动画制作功能,如图像到视频生成、帧插值和局部图像引导动画。我们还收集了一个包含948个多样化动画视频的评估基准,在VBench和人类双盲测试上的评估表明其在角色和运动方面具有一致性,在动画视频生成领域取得了最先进的成果。我们的评估基准将在https://github.com/bilibili/Index-anisora 公开提供。