Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model's text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.
翻译:文本到图像合成模型(如DALL-E~2和Stable Diffusion)近期引起了学术界和公众的广泛关注。这些模型能够根据文本描述生成高质量图像,描绘多样化的概念和风格。然而,这些模型从其海量训练数据中习得了与特定Unicode脚本相关的文化特征,而这些特征可能并不显而易见。我们证明,通过在文本描述中简单插入单个非拉丁字符,常见模型会在其生成的图像中反映出文化刻板印象和偏见。我们通过定性和定量分析研究了这一行为,并确定模型的文本编码器是该现象的根本原因。此外,恶意用户或服务提供商可能试图通过用相似外观的非拉丁字符(即所谓的同形字符)替换拉丁字符,故意偏向图像生成以制造种族主义刻板印象。为缓解此类隐蔽的脚本攻击,我们提出了一种新颖的同形字符遗忘方法,用于微调文本编码器,使其对同形字符操纵具有鲁棒性。