This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.
翻译:本文提出StreamV2V,一种能够通过用户提示实现实时流式视频到视频(V2V)转换的扩散模型。与先前使用批处理有限帧的V2V方法不同,我们选择以流式方式处理帧以支持无限帧序列。StreamV2V的核心在于连接当前与过去的回溯原则,通过维护记录历史帧信息的特征库实现。对于输入帧,StreamV2V将自注意力机制扩展至包含库中的键值对,并直接将相似的过往特征融合到输出中。特征库通过合并存储特征与新特征持续更新,保持紧凑而信息丰富。StreamV2V以其适应性和高效性脱颖而出,无需微调即可与图像扩散模型无缝集成。在单张A100 GPU上可实现20 FPS处理速度,分别比FlowVid、CoDeF、Rerender和TokenFlow快15倍、46倍、108倍和158倍。定量指标与用户研究证实了StreamV2V在保持时间一致性方面的卓越能力。