Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.
翻译:文本到图像(TTI)模型,如DALL-E和StableDiffusion,已展现出卓越的基于提示的图像生成能力。多语言编码器可能对这些模型的文化能动性产生重大影响,因为语言是文化的载体。在本研究中,我们通过在三层层级结构(文化维度、文化领域和文化概念)上表征文化,探索了TTI模型中嵌入的文化感知。基于此本体,我们推导出提示模板以解锁TTI模型中的文化知识,并提出了一套全面的评估技术,包括使用CLIP空间的内在评估、结合视觉问答(VQA)模型的外在评估以及人工评估,以评估TTI生成图像的文化内容。为支持我们的研究,我们引入了CulText2I数据集,该数据集源自四种不同的TTI模型,涵盖十种语言。我们的实验就TTI模型中文化编码本质的“做什么、是什么、哪些以及如何”研究问题提供了见解,为这些模型的跨文化应用铺平了道路。