Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
翻译:近期研究表明,文本到图像模型往往无法生成具有地理代表性的图像,这引发了对其训练数据代表性的担忧,并促使我们思考:这些训练样本究竟来自世界哪些地区?我们通过基于大语言模型从字幕中提取位置信息,将图像-字幕对映射至相应国家,从而对大规模多模态数据集进行地理画像分析。通过研究三个广泛使用的数据集(Re-LAION、DataComp1B和Conceptual Captions)中涉及20个常见实体(如房屋、旗帜)的英文字幕,我们发现美国、英国和加拿大占据了48.0%的样本,而南美洲和非洲国家则严重缺乏代表性,图像占比分别仅为1.8%和3.8%。我们观察到国家GDP与其在数据中的代表性存在强相关性(ρ=0.82)。通过考察Re-LAION数据集中4种语言的非英语子集,我们发现数据代表性严重偏向这些语言的主要使用国。此外,研究还表明更高的代表性并不必然转化为更大的视觉或语义多样性。最后,通过分析基于Re-LAION训练的Stable Diffusion v1.3生成的国别特定图像,我们证明虽然生成图像看起来逼真,但与真实世界图像相比,其覆盖范围存在严重局限性。