Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

翻译：近年来，文本到图像（T2I）生成技术的进展带来了令人印象深刻的视觉结果。然而，这些模型在处理复杂提示时仍面临重大挑战，尤其是涉及多个具有不同属性的主体时。受人类绘画过程（先勾勒构图，再逐步添加细节）的启发，我们提出Detail++，这是一种无训练框架，通过引入新颖的渐进式细节注入（PDI）策略来解决这一局限。具体而言，我们将复杂提示分解为一系列简化的子提示，分阶段引导生成过程。这种分阶段生成利用自注意力固有的布局控制能力，首先确保全局构图，随后进行精确细化。为实现属性与对应主体之间的准确绑定，我们利用交叉注意力机制，并进一步在测试时引入质心对齐损失以减少绑定噪声并增强属性一致性。在T2I-CompBench和新型风格组合基准上的大量实验表明，Detail++显著优于现有方法，尤其在涉及多对象和复杂风格条件的场景中表现突出。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

【书籍】从零开始构建文本生成图像生成器：基于 Transformers 与扩散模型

专知会员服务

25+阅读 · 2025年12月27日

【CVPR2025】先获取后适配：挖掘文本‑图像生成模型在图像复原中的潜力

专知会员服务

11+阅读 · 2025年4月22日

IMAGINE-E：最先进文本到图像模型的图像生成智能评估

专知会员服务

13+阅读 · 2025年2月3日

文本到图像合成：十年回顾

专知会员服务

31+阅读 · 2024年11月26日