The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that generalization emerges out of large-scale pretraining. We hypothesize that the underlying phenomenological coverage has not been proportionally scaled up, leading to a skew of the presented phenomenon which harms generalization. We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning, and show that generalization failures of text-to-image generation are a direct result of incomplete or unbalanced phenomenological coverage. We first perform experiments in a synthetic domain and demonstrate that systematically controlled metrics are strongly predictive of generalization performance. Then we move to natural images and show that simple distribution perturbations in light of our theories boost generalization without enlarging the absolute data size. This work informs an important direction towards quality-enhancing the data diversity or balance orthogonal to scaling up the absolute size. Our discussions point out important open questions on 1) Evaluation of generated entity-relation compositions, and 2) Better models for reasoning with abstract relations.
翻译:关于文本到图像生成的文献中普遍存在无法忠实组合实体与关系的问题。但目前缺乏对实体-关系组合如何被有效学习的正式理解。此外,能够有意义反映问题结构的潜在现象空间尚未得到明确定义,这导致了一场通过扩大数据量寄希望于从大规模预训练中涌现泛化能力的军备竞赛。我们假设潜在的现象覆盖范围并未成比例扩大,导致呈现现象存在偏斜,进而损害了泛化能力。我们引入了统计指标来量化用于关系学习的数据集在语言和视觉层面的偏斜程度,并证明文本到图像生成的泛化失败直接源于不完整或不平衡的现象覆盖。我们首先在合成领域进行实验,证明系统可控的指标能强有力地预测泛化性能。随后转向自然图像领域,展示基于我们理论的简单分布扰动无需扩大绝对数据量即可提升泛化能力。这项工作指向了一个重要方向——在扩大绝对数据量之外,通过提升数据多样性或平衡性来优化数据质量。我们的讨论指出了两个关键开放问题:1) 生成实体-关系组合的评估方法,以及2) 用于抽象关系推理的更优模型。