Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .
翻译:文本到图像(T2I)合成技术近期取得了显著进展。然而,模型在构成性(即从已知组件创建新组合的能力)方面仍存在挑战。我们提出Winoground-T2I基准,专门用于评估T2I模型的构成性。该基准包含覆盖20个类别的1.1万个高质量复杂对比句子对。这些具有细微差异的对比句子对能够实现对T2I合成模型的细粒度评估。此外,为解决不同评价指标间的不一致性问题,我们提出一种策略:通过对比句子对评估各类指标的可靠性。我们利用Winoground-T2I实现双重目标:既评估T2I模型性能,又评估用于评价这些模型的评价指标。最后,我们深入分析了这些指标的优劣,以及当前T2I模型在应对各类复杂构成性类别挑战时的能力。我们的基准已在https://github.com/zhuxiangru/Winoground-T2I 公开。