This paper investigates how popular text-to-image (T2I) models, DALL-E 3 and Gemini 3 Pro Preview, depict people from 206 nationalities when prompted to generate images of individuals engaging in common everyday activities. Five scenarios were developed, and 2,060 images were generated using input prompts that specified nationalities across five activities. When aggregating across activities and models, results showed that 28.4% of the images depicted individuals wearing traditional attire, including attire that is impractical for the specified activities in several cases. This pattern was statistically significantly associated with regions, with the Middle East & North Africa and Sub-Saharan Africa disproportionately affected, and was also associated with World Bank income groups. Similar region- and income-linked patterns were observed for images labeled as depicting impractical attire in two athletics-related activities. To assess image-text alignment, CLIP, ALIGN, and GPT-4.1 mini were used to score 9,270 image-prompt pairs. Images labeled as featuring traditional attire received statistically significantly higher alignment scores when prompts included country names, and this pattern weakened or reversed when country names were removed. Revised prompt analysis showed that one model frequently inserted the word "traditional" (50.3% for traditional-labeled images vs. 16.6% otherwise). These results indicate that these representational patterns can be shaped by several components of the pipeline, including image generator, evaluation models, and prompt revision.
翻译:本文研究了流行的文本到图像(T2I)模型DALL-E 3和Gemini 3 Pro Preview在接收到生成不同国籍个体参与常见日常活动图像的提示时,如何描绘来自206个国家的人。研究设计了五种场景,并使用指定了五种活动中不同国籍的输入提示生成了2,060张图像。当汇总所有活动和模型的结果时,数据显示有28.4%的图像描绘了穿着传统服饰的个体,其中在多个案例中,所描绘的服饰对于指定的活动而言并不实用。这种模式在统计上显著与地区相关联,中东与北非以及撒哈拉以南非洲地区受到不成比例的影响,同时也与世界银行的收入分组相关。在两项与体育运动相关的活动中,被标记为描绘不实用服饰的图像也观察到了类似的与地区和收入相关的模式。为了评估图像与文本的对齐度,研究使用了CLIP、ALIGN和GPT-4.1 mini对9,270个图像-提示对进行评分。当提示中包含国家名称时,被标记为包含传统服饰的图像获得了统计上显著更高的对齐分数,而当国家名称被移除时,这种模式减弱或逆转。修订后的提示分析表明,一个模型频繁地插入“传统”一词(在被标记为传统的图像中占50.3%,而在其他情况下仅占16.6%)。这些结果表明,这些表征模式可能受到流程中多个组成部分的影响,包括图像生成器、评估模型和提示修订。