Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.
翻译:尽管近期文本到图像模型具备生成高质量图像的惊人能力,但当前方法通常难以将具有不同属性和关系的对象有效组合成复杂且连贯的场景。我们提出T2I-CompBench,一个面向开放式组合文本到图像生成的综合基准,包含来自3个类别(属性绑定、对象关系与复杂组合)及6个子类别(颜色绑定、形状绑定、纹理绑定、空间关系、非空间关系与复杂组合)的6000个组合文本提示。我们进一步提出多个专为评估组合文本到图像生成设计的评价指标。引入一种新方法——基于奖励驱动样本选择的生成模型微调(GORS),以增强预训练文本到图像模型的组合生成能力。通过大量实验与评估,在T2I-CompBench上对先前方法进行基准测试,并验证了我们提出的评价指标与GORS方法的有效性。项目页面见https://karine-h.github.io/T2I-CompBench/。