The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.
翻译:从文本描述生成高保真视频的需求日益增长,推动该领域研究取得显著进展。本文提出MagicVideo-V2,将文本到图像模型、视频运动生成器、参考图像嵌入模块及帧插值模块整合为端到端视频生成流水线。得益于这些架构设计,MagicVideo-V2能够生成具有卓越保真度与平滑度的高美学高分辨率视频。通过大规模用户评估,该模型在Runway、Pika 1.0、Morph、Moon Valley及Stable Video Diffusion模型等领先文本到视频系统中展现出更优性能。