Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator.
翻译:大规模文本-图像对预训练的文本到图像模型在图像合成方面表现出色。然而,图像比纯文本能提供更直观的视觉概念。人们可能会问:如何将所需的视觉概念融入现有图像(如人像)?当前方法因缺乏内容保持或视觉概念有效转换能力而难以满足这一需求。受此启发,我们提出了一种名为视觉概念转换器(VCT)的新框架,能够保持源图像内容并基于单张参考图像转换视觉概念。所提出的VCT包含内容-概念反转(CCI)过程以提取内容和概念,以及内容-概念融合(CCF)过程以整合提取信息获得目标图像。仅需一张参考图像,该VCT即可在多种通用图像到图像翻译任务中取得优异效果。大量实验证明了所提方法的优越性和有效性。代码已开源至https://github.com/CrystalNeuro/visual-concept-translator。