Recent image harmonization methods have demonstrated promising results. However, due to their heavy reliance on a large number of composite images, these works are expensive in the training phase and often fail to generalize to unseen images. In this paper, we draw lessons from human behavior and come up with a zero-shot image harmonization method. Specifically, in the harmonization process, a human mainly utilizes his long-term prior on harmonious images and makes a composite image close to that prior. To imitate that, we resort to pretrained generative models for the prior of natural images. For the guidance of the harmonization direction, we propose an Attention-Constraint Text which is optimized to well illustrate the image environments. Some further designs are introduced for preserving the foreground content structure. The resulting framework, highly consistent with human behavior, can achieve harmonious results without burdensome training. Extensive experiments have demonstrated the effectiveness of our approach, and we have also explored some interesting applications.
翻译:近期图像和谐化方法已展现出显著成效。然而,这些方法严重依赖大量合成图像,导致训练阶段成本高昂,且难以泛化至未见图像。本文从人类行为中汲取灵感,提出一种零样本图像和谐化方法。具体而言,人类在和谐化过程中主要利用其关于和谐图像的长期先验知识,使合成图像趋近于该先验。为模拟这一过程,我们借助预训练生成模型获取自然图像先验。为引导和谐化方向,我们提出一种可优化的注意力约束文本(Attention-Constraint Text),以准确描述图像环境。此外,我们引入若干设计以保留前景内容结构。最终框架与人类行为高度一致,无需繁重训练即可实现和谐化效果。大量实验验证了本方法的有效性,并探讨了若干有趣的应用场景。