StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

翻译：我们提出StreamDiffusion，一种专为交互式图像生成设计的实时扩散流水线。现有扩散模型擅长从文本或图像提示生成图像，但在实时交互方面往往存在不足。这一局限在涉及连续输入的场景（如元宇宙、实时视频流和广播）中尤为突出，此类场景要求高吞吐量。为解决此问题，我们提出一种创新方法，将原始的顺序去噪过程转化为批量化去噪过程。Stream Batch摒弃了传统的"等待-交互"模式，实现了流畅且高吞吐量的数据流。为处理数据输入与模型吞吐量之间的频率差异，我们设计了一种新型输入输出队列以并行化流式处理过程。此外，现有扩散流水线采用无分类器引导（CFG），需额外进行U-Net计算。为减少冗余计算，我们提出新型残差无分类器引导（RCFG）算法，将负条件去噪步骤降低至仅一步甚至零步。同时引入随机相似性滤波器（SSF）以优化功耗。我们的Stream Batch在不同去噪层级下相比顺序去噪方法可实现约1.5倍加速。所提出的RCFG比传统CFG速度提升高达2.05倍。结合所提策略与现有成熟加速工具，基于单张RTX4090显卡的图像到图像生成帧率可达91.07fps，相较Diffusers开发的AutoPipeline吞吐量提升超过59.56倍。此外，我们提出的StreamDiffusion在单张RTX3060和RTX4090显卡上分别实现2.39倍和1.99倍的能耗降低。