It has been shown that accurate representation in media improves the well-being of the people who consume it. By contrast, inaccurate representations can negatively affect viewers and lead to harmful perceptions of other cultures. To achieve inclusive representation in generated images, we propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset that we collected, known here as Cross-Cultural Understanding Benchmark (CCUB) Dataset, to fight the bias prevalent in giant datasets. Our proposed approach is comprised of two fine-tuning techniques: (1) Adding visual context via fine-tuning a pre-trained text-to-image synthesis model, Stable Diffusion, on the CCUB text-image pairs, and (2) Adding semantic context via automated prompt engineering using the fine-tuned large language model, GPT-3, trained on our CCUB culturally-aware text data. CCUB dataset is curated and our approach is evaluated by people who have a personal relationship with that particular culture. Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images while maintaining quality.
翻译:研究表明,媒体中的准确表征能提升受众的福祉,而失实表征则会对观者产生负面影响,并导致对其他文化的有害认知。为实现生成图像中的包容性表征,我们提出一种文化感知启动方法,利用自行采集的小型但经过文化筛选的数据集——即跨文化理解基准(CCUB)数据集——以对抗大型数据集中普遍存在的偏见,该方法适用于文本到图像合成任务。所提方法包含两种微调技术:(1)通过使用CCUB文本-图像对微调预训练文本到图像合成模型Stable Diffusion,添加视觉上下文;(2)通过使用基于CCUB文化感知文本数据训练的大语言模型GPT-3进行自动化提示工程,添加语义上下文。CCUB数据集经过严格筛选,并由与特定文化具有个人联系的评估者进行方法评价。实验表明,结合文本与图像的启动方法能有效提升生成图像的文化相关性并降低冒犯性,同时保持图像质量。