Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. Our evaluation benchmark will be publicly available at https://github.com/bilibili/Index-anisora.
翻译:近年来,动画在影视行业中获得了极大的关注。尽管Sora、Kling和CogVideoX等先进视频生成模型在生成自然视频方面取得了成功,但在处理动画视频方面却缺乏同等效力。由于动画独特的艺术风格、违反物理定律和夸张的动作,评估动画视频生成也是一项巨大挑战。本文提出了一个全面的动画视频生成系统AniSora,包括数据处理流程、可控生成模型和评估数据集。在超过1000万高质量数据支持的数据处理流程基础上,生成模型引入了时空掩码模块,以实现关键动画制作功能,如图像到视频生成、帧插值和局部图像引导动画。我们还收集了包含948个多样化动画视频的评估基准,在VBench和人类双盲测试上的评估表明其在角色和运动方面的一致性,在动画视频生成领域取得了最先进的结果。我们的评估基准将在https://github.com/bilibili/Index-anisora 公开提供。