We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.
翻译:我们提出LongLive-2.0,一种基于NVFP4的并行基础设施,贯穿长视频生成的完整训练与推理流程,旨在解决速度与内存瓶颈。在训练方面,我们引入序列并行自回归(AR)训练,具体实例化为平衡序列并行(Balanced SP),该方法通过在每个计算节点上配对干净历史块与噪声目标块,协同设计高效的教师强制布局与序列并行执行,从而支持具备SP感知分块VAE编码的自然教师强制掩码。结合NVFP4精度,该方案在训练期间降低GPU内存开销并加速GEMM计算,且随着视频长度增长,其加速比例进一步提升。此外,我们证明高质量的基础设施与数据集能够实现极为简洁的训练流程。与依赖常微分方程初始化及后续分布匹配蒸馏(DMD)的现有Self-Forcing系列方法不同,LongLive-2.0直接将扩散模型微调为长时、多镜头、交互式自回归(AR)扩散模型,并可进一步通过独立LoRA权重转换为实时生成(4至2步去噪)。在Blackwell GPU推理方面,我们实现W4A4 NVFP4推理,将KV缓存量化为NVFP4以节省内存,并通过异步流式VAE解码提升端到端吞吐量。在非Blackwell GPU架构上,我们部署序列并行推理以匹配Blackwell GPU速度,同时量化后的KV缓存可降低序列并行的GPU间通信量。实验表明,训练加速比达2.15倍,推理加速比达1.84倍。LongLive-2.0-5B在基准测试中实现45.7 FPS推理的同时保持强劲性能。据我们所知,LongLive-2.0是首个用于长视频生成的NVFP4训练与推理系统。