Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
翻译:近年来,文本到图像(T2I)生成技术的进展带来了令人印象深刻的视觉结果。然而,这些模型在处理复杂提示时仍面临重大挑战,尤其是涉及多个具有不同属性的主体时。受人类绘画过程(先勾勒构图,再逐步添加细节)的启发,我们提出Detail++,这是一种无训练框架,通过引入新颖的渐进式细节注入(PDI)策略来解决这一局限。具体而言,我们将复杂提示分解为一系列简化的子提示,分阶段引导生成过程。这种分阶段生成利用自注意力固有的布局控制能力,首先确保全局构图,随后进行精确细化。为实现属性与对应主体之间的准确绑定,我们利用交叉注意力机制,并进一步在测试时引入质心对齐损失以减少绑定噪声并增强属性一致性。在T2I-CompBench和新型风格组合基准上的大量实验表明,Detail++显著优于现有方法,尤其在涉及多对象和复杂风格条件的场景中表现突出。