Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.

翻译：扩散模型革命性地推动了生成式内容创作，尤其是文本到图像（T2I）扩散模型通过允许使用自然语言进行场景合成，极大提升了用户的创作自由度。T2I模型擅长合成名词、外观和风格等概念。为实现基于少量示例图像的自定义内容创作，文本反转和DreamBooth等方法能反转所需概念，并使其在新场景中生成。然而，通过自然语言反转超越物体外观和风格的更通用概念（如形容词和动词）仍面临挑战。这些概念的两个关键特性限制了当前反转方法的能力：1) 形容词和动词与名词（主体）纠缠不清，可能阻碍基于外观的反转方法，导致主体外观泄漏到概念嵌入中；2) 描述此类概念往往超出单个词嵌入的范围（如“冻在冰里”、“走钢丝”等），当前方法无法处理。在本研究中，我们提出乐高（Lego），一种专为从少量示例图像中反转与主体纠缠的概念而设计的文本反转方法。乐高通过简单而有效的主体分离步骤将概念与其关联主体解耦，并采用上下文损失（Context Loss）指导单嵌入/多嵌入概念的反转。在详尽的用户研究中，与基线相比，乐高生成的概念在超过70%的情况下被更受青睐。此外，使用大语言模型进行的视觉问答表明，乐高生成的概念与概念的文本描述对齐程度更高。