As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.
翻译:随着文本到图像(TTI)扩散模型在内容创作中的影响力日益增强,其社会与文化影响正受到越来越多的关注。尽管先前研究主要考察了人口统计与文化偏见,但这些模型准确表征历史语境的能力在很大程度上仍未得到充分探索。为填补这一空白,我们引入了一个用于评估TTI模型如何描绘历史语境的基准。该基准结合了HistVis数据集(包含由三种先进扩散模型根据精心设计的提示生成的30,000张合成图像,涵盖多个历史时期中普遍的人类活动)与一套可复现的评估方案。我们从三个关键维度评估生成的图像:(1)隐含风格关联:考察与特定时代相关联的默认视觉风格;(2)历史一致性:识别时代错位现象,例如前现代语境中出现现代器物;(3)人口表征:将生成的种族与性别分布与历史可信基线进行比较。我们的研究揭示了历史主题生成图像中存在系统性不准确问题:TTI模型经常通过融入未声明的风格线索来刻板化过去时代,引入时代错位,且未能反映可信的人口分布模式。通过为生成图像中的历史表征提供可复现的基准,本研究为构建更具历史准确性的TTI模型迈出了初步的一步。