Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
翻译:视频生成技术正迅速向统一的音视频生成方向发展。本文提出ALIVE生成模型,该模型通过适配预训练的文本到视频(T2V)模型,实现了Sora风格的音视频生成与动画创作。相较于T2V基础模型,本模型解锁了文本到音视频(T2VA)与参考内容到音视频(动画)的生成能力。为支持音视频同步与参考动画生成,我们在主流MMDiT架构中增加了联合音视频分支,该分支包含用于时序对齐跨模态融合的TA-CrossAttn模块,以及实现精确音画对齐的UniTemp-RoPE机制。同时,我们精心设计了包含音视频标注、质量控制等环节的完整数据流程,以收集高质量微调数据。此外,本文还提出了新的基准测试体系以进行全面模型评估与比较。经过百万级高质量数据的持续预训练与微调,ALIVE展现出卓越性能,持续超越开源模型,并达到或超越了当前最先进的商业解决方案水平。通过提供详细的技术方案与基准测试,我们希望ALIVE能助力学界更高效地开发音视频生成模型。官方页面:https://github.com/FoundationVision/Alive。