Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
翻译:近年来,文本到图像(T2I)生成模型取得了显著进展,在生成与文本提示对齐的高质量图像方面有了重大改进。然而,这些模型在处理涉及多个对象的提示时仍然存在困难,经常导致对象遗漏或对象混合。通过广泛研究,我们识别了四种问题场景:相似形状、相似纹理、不同背景偏差和多个对象,在这些场景中,对象间的关系常常导致此类失败。基于对CLIP嵌入的两个关键观察,我们提出了DOS(定向对象分离),该方法在将三种类型的CLIP文本嵌入传递到文本到图像模型之前对其进行修改。实验结果表明,DOS持续提高了多对象图像生成的成功率并减少了对象混合。在人类评估中,DOS显著优于四种竞争方法,在四个基准测试中获得了26.24%至43.04%的更多投票。这些结果突显了DOS作为一种实用且有效的解决方案,用于改进多对象图像生成。