Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
翻译:准确解读和可视化人类指令对于文本到图像(T2I)合成至关重要。然而,现有模型难以捕捉由词序变化引起的语义变化,而依赖文本-图像相似度等间接指标的现有评估方法无法可靠地评估这些挑战。对高频词汇组合的关注往往掩盖了模型在处理复杂或非常见语言模式时的性能缺陷。为弥补这些不足,我们提出了一种名为SemVarEffect的新颖度量标准和一个名为SemVarBench的基准测试,旨在评估T2I合成中输入与输出间语义变化的因果关系。语义变化通过两种类型的语言置换实现,同时避免易于预测的字面变化。实验表明,CogView-3-Plus和Ideogram 2表现最佳,得分为0.2/1。模型对物体关系语义变化的理解弱于属性变化,得分分别为0.07/1和0.17-0.19/1。我们发现,UNet或Transformer中的跨模态对齐在处理语义变化方面起着关键作用,这一因素先前因过度关注文本编码器而被忽视。我们的工作建立了一个有效的评估框架,推动了T2I合成领域对人类指令理解的探索。