Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process. Extensive empirical results show that FISEdit can be $3.4\times$ and $4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.
翻译:由于扩散模型近期取得的成功,文本到图像生成日益流行并实现了广泛应用。其中,文本到图像编辑(即连续文本到图像生成)备受关注,且有望提升生成图像的质量。用户常通过多次扩散推理迭代,对输入文本描述进行细微调整以修正生成图像。然而,这种图像编辑过程受限于现有众多扩散模型(即便使用GPU加速器)的低推理效率。为解决该问题,我们提出Fast Image Semantically Edit (FISEdit)——一种基于缓存机制的稀疏扩散模型推理引擎,用于高效文本到图像编辑。该方法的核心思想在于利用输入文本的细微修改与输出图像受影响区域之间的语义映射。在每个文本编辑步骤中,FISEdit能自动识别受影响的图像区域,并利用缓存中未变化区域的特征图加速推理过程。大量实证结果表明,FISEdit在NVIDIA TITAN RTX和A100 GPU上的运行速度分别比现有方法快3.4倍和4.4倍,且生成的图像质量更令人满意。