We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
翻译:我们介绍了Helios,这是首个在单张NVIDIA H100 GPU上以19.5 FPS运行、支持分钟级生成且质量媲美强基线模型的140亿参数视频生成模型。我们在三个关键维度取得突破:(1) 无需自强制、误差库或关键帧采样等常用抗漂移启发式方法,即可实现长视频生成的强鲁棒性;(2) 无需KV缓存、稀疏/线性注意力或量化等标准加速技术,即可实现实时生成;(3) 无需并行化或分片框架即可完成训练,在80 GB GPU内存内最多容纳四个140亿参数模型的同时,实现图像扩散模型级别的批处理规模。具体而言,Helios是一个140亿参数的自回归扩散模型,采用统一输入表征,原生支持文本到视频(T2V)、图像到视频(I2V)和视频到视频(V2V)任务。为缓解长视频生成中的漂移问题,我们系统分析了典型失效模式,并提出简单而有效的训练策略:在训练中显式模拟漂移现象,同时从源头消除重复性运动。在效率方面,我们通过大幅压缩历史噪声上下文并减少采样步数,使计算成本与13亿参数视频生成模型相当甚至更低。此外,我们引入了基础设施层面的优化方案,在降低内存占用的同时加速推理与训练过程。大量实验表明,Helios在短视频与长视频生成任务中均持续优于现有方法。我们将开源代码、基础模型与蒸馏模型,以支持社区进一步开发。