Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. %We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our model access API and evaluation benchmark will be publicly available.
翻译:近年来,动画在影视行业中获得了显著关注。尽管如Sora、Kling和CogVideoX等先进视频生成模型在生成自然视频方面取得了成功,但它们在处理动画视频方面缺乏同等效力。由于动画独特的艺术风格、违背物理定律的特性以及夸张的运动,评估动画视频生成也面临巨大挑战。本文提出了一个全面的动画视频生成系统AniSora,该系统包含数据处理流程、可控生成模型和评估数据集。在超过1000万条高质量数据支持的数据处理流程基础上,生成模型引入了时空掩码模块,以支持关键动画制作功能,如图像到视频生成、帧插值和局部图像引导动画。我们还收集了一个包含948个多样化动画视频的评估基准,在VBench和人类双盲测试上的评估表明,该系统在角色与运动一致性方面表现优异,实现了动画视频生成的最先进水平。我们的模型访问API和评估基准将公开提供。