Training Priors Predict Text-To-Image Model Performance

Text-to-image models can often generate some relations, i.e., "astronaut riding horse", but fail to generate other relations composed of the same basic parts, i.e., "horse riding astronaut". These failures are often taken as evidence that models rely on training priors rather than constructing novel images compositionally. This paper tests this intuition on the stablediffusion 2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads that underlie these prompts (e.g., "astronaut", "ride", "horse"), we find that the more often an SVO triad appears in the training data, the better the model can generate an image aligned with that triad. Here, by aligned we mean that each of the terms appears in the generated image in the proper relation to each other. Surprisingly, this increased frequency also diminishes how well the model can generate an image aligned with the flipped triad. For example, if "astronaut riding horse" appears frequently in the training data, the image for "horse riding astronaut" will tend to be poorly aligned. Our results thus show that current models are biased to generate images with relations seen in training, and provide new data to the ongoing debate on whether these text-to-image models employ abstract compositional structure in a traditional sense, or rather, interpolate between relations explicitly seen in the training data.

翻译：文生图模型通常能够生成某些关系，例如“宇航员骑马”，但无法生成由相同基本元素构成的其他关系，例如“马骑宇航员”。这些失败常被视为模型依赖训练先验而非通过组合方式构建新颖图像的证据。本文通过stablediffusion 2.1文生图模型检验了这一直觉。通过分析这些提示词所隐含的主-谓-宾（SVO）三元组（例如“宇航员”、“骑”、“马”），我们发现SVO三元组在训练数据中出现得越频繁，模型生成与该三元组对齐的图像效果就越好。此处“对齐”指生成图像中每个术语以正确关系相互呈现。令人惊讶的是，这种频率增加还会削弱模型生成与翻转三元组对齐图像的能力。例如，若“宇航员骑马”在训练数据中频繁出现，则“马骑宇航员”的图像往往对齐效果较差。因此，我们的结果表明当前模型倾向于生成训练中已见关系的图像，并为正在进行的关于这些文生图模型是否采用传统意义上的抽象组合结构，抑或是训练数据显式见到的关系之间插值的辩论提供了新数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日