Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}.

翻译：尽管文本到图像（T2I）模型展现出卓越的生成能力，它们却常常无法准确绑定输入提示中语义相关的对象或属性；这一挑战被称为语义绑定。先前的方法要么需要对整个T2I模型进行密集微调，要么要求用户或大型语言模型指定生成布局，增加了复杂性。在本文中，我们将语义绑定定义为将给定对象与其属性关联（称为属性绑定）或将其与其他相关子对象链接（称为对象绑定）的任务。我们提出了一种名为Token Merging（ToMe）的新方法，该方法通过将相关token聚合为单个复合token来增强语义绑定。这确保了对象、其属性和子对象都共享相同的交叉注意力图。此外，为解决复杂文本提示中主要对象可能产生的混淆，我们提出了末端token替换作为补充策略。为了在T2I生成的初始阶段（即布局确定阶段）进一步优化我们的方法，我们引入了两种辅助损失——熵损失和语义绑定损失——以迭代更新复合token，从而提升生成完整性。我们进行了大量实验以验证ToMe的有效性，并在T2I-CompBench和我们提出的GPT-4o对象绑定基准上将其与多种现有方法进行了比较。我们的方法在涉及多个对象和属性的复杂场景中尤其有效，而这些场景往往是先前方法难以处理的。代码将在\url{https://github.com/hutaihang/ToMe}公开提供。