Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.
翻译:基于扩散模型的文本到图像生成技术,因其能够根据文本提示生成高质量图像而迅速普及。然而,生产级扩散模型服务是一项资源密集型任务,不仅需要昂贵的高端GPU,还会带来显著的延迟。本文提出一种名为"近似缓存"的技术,通过复用先前为相似提示生成图像时创建的中间噪声状态,减少基于提示生成图像的迭代去噪步骤。基于这一思想,我们构建了端到端文本到图像系统Nirvana,该系统采用近似缓存技术及新型缓存管理策略——最低计算效益最常使用策略(LCBFU),在两个真实生产负载上平均实现%的GPU计算节省、19.8%的端到端延迟降低以及19%的成本节约。此外,我们从缓存角度、中间状态在大型生产环境中的流行度与复用率出发,对真实生产环境中的文本到图像提示进行了广泛特征分析。