Despite significant advancements in image customization with diffusion models, current methods still have several limitations: 1) unintended changes in non-target areas when regenerating the entire image; 2) guidance solely by a reference image or text descriptions; and 3) time-consuming fine-tuning, which limits their practical application. In response, we introduce a tuning-free framework for simultaneous text-image-guided image customization, enabling precise editing of specific image regions within seconds. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. To achieve this, we propose an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. To our knowledge, this is the first tuning-free method that concurrently utilizes text and image guidance for image customization in specific regions. Our approach outperforms previous methods in both human and quantitative evaluations, providing an efficient solution for various practical applications, such as image synthesis, design, and creative photography.
翻译:尽管扩散模型在图像定制方面取得了显著进展,当前方法仍存在若干局限:1)重新生成整张图像时非目标区域的非预期改变;2)仅能由参考图像或文本描述进行单一引导;3)需要耗时的微调过程,制约了实际应用。为此,我们提出一种无需微调的框架,实现文本与图像联合引导的图像定制,能在数秒内对特定图像区域进行精确编辑。该方法在保留参考图像主体语义特征的同时,允许基于文本描述修改细节属性。我们创新性地提出注意力混合策略,在去噪过程中混合UNet解码器的自注意力特征。据我们所知,这是首个同时利用文本与图像引导实现特定区域图像定制的免微调方法。在人工评估与定量评估中,本方法均优于现有方案,为图像合成、设计、创意摄影等实际应用提供了高效解决方案。