Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.
翻译:文本到图像(T2I)模型正日益被全球用户广泛采用。然而,先前研究指出T2I模型对特定输入语言具有高度敏感性——当面对非英语(即相同提示的不同表面形式)时,T2I模型常生成文化刻板印象的描绘,优先考虑表面形式而忽视提示的语义内涵。目前尚缺乏对这种被我们称为“表面形式优先于语义”(SoS)行为的系统性分析。本研究首次对T2I模型的SoS倾向进行量化分析。为此,我们构建了涵盖171种文化身份的提示词集,并将其翻译为14种语言,用于测试七个T2I模型。为量化不同模型、语言和文化背景下的SoS倾向,我们提出了一种新颖的度量方法,并分析了所识别倾向的视觉表现形式。实验表明,除一个模型外,所有模型至少在两种语言中表现出强烈的表面形式优先倾向,且这种效应在T2I文本编码器的层级中逐渐增强。此外,这些表面倾向常与刻板视觉描绘呈现显著相关性。