Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.
翻译:文本反演(Textual Inversion)作为一种提示学习方法,通过为新的“词语”学习单一嵌入来表征图像的风格与外观,使其能够融入自然语言句子以生成新颖的合成图像。然而,即使单个概念的嵌入可获取,在单一场景中识别并整合多个目标级概念仍面临重大挑战,这一点经我们的实证测试得到进一步证实。为解决这一难题,我们提出多概念提示学习(MCPL)框架,该框架可从单句-图像对中同时学习多个新“词语”。为提升词语-概念关联的准确性,我们提出三种正则化技术:注意力掩码(AttnMask)用于聚焦相关区域的学习;提示对比损失(PromptCL)用于分离不同概念的嵌入;以及绑定形容词(Bind adj.)用于将新“词语”与已知词语关联。我们通过图像生成、编辑及注意力可视化对多样化图像进行评估。大量定量比较表明,我们的方法能学习到语义更解耦的概念,并增强词语-概念关联。此外,针对这一目标级概念学习的新任务,我们还引入了一个全新数据集与评估协议。