Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.
翻译:大型语言模型凭借其时间上的单向注意力机制,能够有效建模当前标记与先前标记之间的相关性,从而在生成文本和音频等流式数据方面展现出卓越性能。然而,尽管对实时视频处理的需求日益增长,视频流生成领域仍鲜有探索。当前最先进的视频扩散模型利用双向时间注意力来建模当前帧与所有周围帧(包括未来帧)之间的相关性,这阻碍了它们处理流式视频的能力。为解决这一问题,我们提出了Live2Diff,这是首次尝试设计一种具有单向时间注意力的视频扩散模型,专门针对实时流视频翻译任务。与先前工作相比,我们的方法通过将当前帧与其前驱帧及少量初始预热帧相关联(无需任何未来帧),确保了时间一致性与流畅性。此外,我们采用了一种高效的降噪方案,该方案结合了KV缓存机制与流水线处理,以实现交互式帧率的流式视频翻译。大量实验证明了所提出的注意力机制与流水线架构的有效性,其在时间平滑性和/或效率方面均优于现有方法。