Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .
翻译:文本到图像(T2I)合成技术近期取得了显著进展。然而,模型的组合性(即从已知组件中创建新组合的能力)仍面临挑战。我们提出Winoground-T2I,一个旨在评估T2I模型组合性的基准。该基准包含11K组跨越20个类别的高质量对比性句子对。这些具有细微语义差异的对比句对能实现对T2I合成模型的细粒度评估。此外,为解决不同度量指标间的不一致性,我们提出一种利用对比句对评估各度量指标可靠性的策略。我们使用Winoground-T2I实现双重目标:评估T2I模型的性能,以及评估其评价度量指标的性能。最后,我们揭示了这些度量指标的优缺点,以及当前T2I模型在应对不同复杂组合类别挑战时的能力。本基准已在https://github.com/zhuxiangru/Winoground-T2I 公开提供。