We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.
翻译:本文提出Waver,一种用于统一图像与视频生成的高性能基础模型。Waver能够直接生成时长为5至10秒、原生分辨率为720p的视频,并随后将其超分辨率提升至1080p。该模型在单一集成框架内同时支持文本到视频(T2V)、图像到视频(I2V)及文本到图像(T2I)生成。我们引入混合流式DiT架构以增强模态对齐并加速训练收敛。为确保训练数据质量,我们建立了完整的数据处理流程,并手动标注并训练了基于MLLM的视频质量模型以筛选最高质量样本。此外,我们提供了详细的训练与推断方案以促进高质量视频生成。基于这些贡献,Waver在捕捉复杂运动方面表现卓越,在视频合成中实现了优异的运动幅度与时间一致性。值得注意的是,该模型在Artificial Analysis平台的T2V与I2V排行榜(数据截至2025年7月30日格林尼治标准时间+8上午10时)均位列前三,持续超越现有开源模型,并达到或超越了最先进的商业解决方案。我们希望本技术报告能帮助学界更高效地训练高质量视频生成模型,并加速视频生成技术的发展。官方页面:https://github.com/FoundationVision/Waver。