Recent progress in generative models has resulted in models that produce both realistic as well as relevant images for most textual inputs. These models are being used to generate millions of images everyday, and hold the potential to drastically impact areas such as generative art, digital marketing and data augmentation. Given their outsized impact, it is important to ensure that the generated content reflects the artifacts and surroundings across the globe, rather than over-representing certain parts of the world. In this paper, we measure the geographical representativeness of common nouns (e.g., a house) generated through DALL.E 2 and Stable Diffusion models using a crowdsourced study comprising 540 participants across 27 countries. For deliberately underspecified inputs without country names, the generated images most reflect the surroundings of the United States followed by India, and the top generations rarely reflect surroundings from all other countries (average score less than 3 out of 5). Specifying the country names in the input increases the representativeness by 1.44 points on average for DALL.E 2 and 0.75 for Stable Diffusion, however, the overall scores for many countries still remain low, highlighting the need for future models to be more geographically inclusive. Lastly, we examine the feasibility of quantifying the geographical representativeness of generated images without conducting user studies.
翻译:近期生成模型的进展使得模型能够为大多数文本输入生成既逼真又相关的图像。这些模型每天被用于生成数百万张图像,并有可能对生成艺术、数字营销和数据增强等领域产生深远影响。鉴于其巨大影响,确保生成内容反映全球范围内的物品与环境、而非过度代表某些地区至关重要。本文通过一项涉及27个国家540名参与者的众包研究,衡量了由DALL·E 2和Stable Diffusion模型生成的常见名词(如“房子”)的地理代表性。对于未指定国家名称的模糊输入,生成的图像最常反映美国的环境,其次是印度,而其他国家的环境极少出现在顶级生成结果中(平均得分低于5分中的3分)。在输入中指定国家名称后,DALL·E 2的代表性平均提高1.44分,Stable Diffusion提高0.75分,但许多国家的总体得分仍然较低,这凸显了未来模型需更具地理包容性。最后,我们探讨了无需进行用户研究即可量化生成图像地理代表性的可行性。