Text-to-image models can often generate some relations, i.e., "astronaut riding horse", but fail to generate other relations composed of the same basic parts, i.e., "horse riding astronaut". These failures are often taken as evidence that models rely on training priors rather than constructing novel images compositionally. This paper tests this intuition on the stablediffusion 2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads that underlie these prompts (e.g., "astronaut", "ride", "horse"), we find that the more often an SVO triad appears in the training data, the better the model can generate an image aligned with that triad. Here, by aligned we mean that each of the terms appears in the generated image in the proper relation to each other. Surprisingly, this increased frequency also diminishes how well the model can generate an image aligned with the flipped triad. For example, if "astronaut riding horse" appears frequently in the training data, the image for "horse riding astronaut" will tend to be poorly aligned. Our results thus show that current models are biased to generate images with relations seen in training, and provide new data to the ongoing debate on whether these text-to-image models employ abstract compositional structure in a traditional sense, or rather, interpolate between relations explicitly seen in the training data.
翻译:文生图模型通常能够生成某些关系,例如“宇航员骑马”,但无法生成由相同基本元素构成的其他关系,例如“马骑宇航员”。这些失败常被视为模型依赖训练先验而非通过组合方式构建新颖图像的证据。本文通过stablediffusion 2.1文生图模型检验了这一直觉。通过分析这些提示词所隐含的主-谓-宾(SVO)三元组(例如“宇航员”、“骑”、“马”),我们发现SVO三元组在训练数据中出现得越频繁,模型生成与该三元组对齐的图像效果就越好。此处“对齐”指生成图像中每个术语以正确关系相互呈现。令人惊讶的是,这种频率增加还会削弱模型生成与翻转三元组对齐图像的能力。例如,若“宇航员骑马”在训练数据中频繁出现,则“马骑宇航员”的图像往往对齐效果较差。因此,我们的结果表明当前模型倾向于生成训练中已见关系的图像,并为正在进行的关于这些文生图模型是否采用传统意义上的抽象组合结构,抑或是训练数据显式见到的关系之间插值的辩论提供了新数据。