Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.
翻译:文本到图像(TTI)模型(如DALL-E和StableDiffusion)已展现出卓越的基于提示的图像生成能力。由于语言是文化的载体,多语言编码器可能对这些模型的文化能动性产生重大影响。本研究通过将文化划分为三个层次——文化维度、文化领域与文化概念——来探索TTI模型中蕴含的文化感知。基于此本体论,我们推导出提示模板以解锁TTI模型中的文化知识,并提出一套综合评估方法,包括使用CLIP空间的内在评估、基于视觉问答(VQA)模型的外在评估以及人工评估,用以衡量TTI生成图像的文化内容。为支撑研究,我们构建了CulText2I数据集,该数据集源自六个不同的TTI模型并涵盖十种语言。我们的实验从“是否”、“内容”、“哪些”及“方式”等研究问题层面揭示了TTI模型中文化编码的特性,为这些模型的跨文化应用铺平了道路。