Text-To-Image (TTI) models, exemplified by DALL-E and StableDiffusion, have recently gained prominence for their remarkable zero-shot capabilities in generating images guided by textual prompts. Language, as a conduit of culture, plays a pivotal role in these models' multilingual capabilities, which in turn shape their cultural agency. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model, and human assessments, to discern TTI cultural perceptions. To facilitate our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments reveal insights into these models' cultural awareness, cultural distinctions, and the unlocking of cultural features, releasing the potential for cross-cultural applications.
翻译:文本到图像(TTI)模型,以DALL-E和Stable Diffusion为代表,近期因其在文本提示引导下的卓越零样本图像生成能力而备受瞩目。语言作为文化的载体,在这些模型的多语言能力中扮演着关键角色,进而塑造其文化主体性。本研究通过三个层级结构——文化维度、文化领域与文化概念——对TTI模型中蕴含的文化感知进行表征。我们提出了一套综合评估方法,包括基于CLIP空间的内源评估、结合视觉问答(VQA)模型的外源评估及人工评估,以解析TTI模型的文化感知。为支撑研究,我们构建了CulText2I数据集,该数据集源自四种不同的TTI模型,涵盖十种语言。实验揭示了这些模型的文化认知、文化差异特征及文化特征的解锁机制,释放了跨文化应用的潜力。