Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
翻译:文本到视频(T2V)生成技术因其能够根据文本描述合成逼真视频而受到广泛关注。然而,现有模型在计算效率与高视觉质量之间难以平衡,尤其在资源受限的设备上(如集成显卡和移动设备)。多数先前工作优先考虑视觉保真度,却忽视了开发适用于实际部署的轻量高效模型的需求。为应对这一挑战,我们提出了一种轻量级T2V框架,命名为Hummingbird,该框架通过剪枝现有模型并利用视觉反馈学习提升视觉质量。我们的方法将U-Net参数量从14亿缩减至7亿,在保持高质量视频生成的同时显著提升效率。此外,我们引入了一种新颖的数据处理流程,利用大语言模型(LLMs)和视频质量评估(VQA)模型来提升文本提示与视频数据的质量。为支持用户驱动的训练与风格定制,我们公开了完整的训练代码,包括数据处理与模型训练。大量实验表明,相较于VideoCrafter2等先进模型,我们的方法实现了31倍的加速,同时在VBench基准测试中获得了最高综合得分。此外,我们的方法支持生成最多26帧的视频,解决了现有基于U-Net方法在生成长视频方面的局限性。值得注意的是,整个训练过程仅需四块GPU,却能提供与现有领先方法相竞争的性能。Hummingbird为T2V生成提供了一个实用高效的解决方案,兼具高性能、可扩展性和实际应用的灵活性。