Iterative Refinement Improves Compositional Image Generation

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

翻译：文本到图像（T2I）模型已取得显著进展，但在处理需要同时涉及多个对象、关系与属性的复杂提示时仍面临挑战。现有的推理时策略，例如结合验证器的并行采样或单纯增加去噪步数，虽能提升提示对齐度，但对于需要满足多重约束的复杂组合场景仍显不足。受大语言模型中思维链推理成功的启发，我们提出一种迭代式测试时策略：在视觉语言模型作为循环中评判者的反馈指导下，T2I模型通过多步迭代逐步优化其生成结果。该方法简洁易行，无需外部工具或先验知识，可灵活适配多种图像生成器与视觉语言模型。实验表明，该方法在多个基准测试中均取得稳定的图像生成性能提升：在ConceptMix（k=7）上的全正确率提升16.9%，在T2I-CompBench（3D空间类别）上提升13.8%，在Visual Jenga场景分解任务上提升12.5%（均与计算量匹配的并行采样基线对比）。除量化提升外，迭代优化通过将复杂提示分解为序列化修正步骤，生成结果更具忠实度：在人工评估中，58.7%的参与者偏好本方法，而41.3%选择并行基线。这些发现共同表明，迭代式自校正可作为组合式图像生成领域广泛适用的基本原则。相关结果与可视化内容详见 https://iterative-img-gen.github.io/