Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.
翻译:文本到图像生成模型在文化同质化场景中已展现出强劲性能,但其生成多元文化场景(即人物与地标源自不同文化背景)的能力仍未得到充分探索。我们提出多元文化文本到图像生成这一新任务,并首次构建了针对该场景的基准数据集。该数据集包含9000张图像,涵盖五个国家、三个年龄段、两种性别、25个历史地标及五种语言。基于这一基准,我们从对齐度、图像质量、美学性、知识准确性与公平性等多个维度,系统分析了当前最优文本到图像生成模型的行为特征。作为融合文化与人口统计信息的一种策略,我们探索了MosAIG框架——一个通过赋予大语言模型不同文化角色来增强多元文化图像生成的多智能体系统。分析表明,相较于简单提示词,更丰富的提示词组合能提升图像质量与文化扎根性,同时揭示了不同语言与人群组别间存在的显著差异。我们已在https://github.com/AIM-SCU/MosAIG 公开数据集与代码。