Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.
翻译:生成式人工智能(GenAI)文本到图像系统正日益广泛地用于生成建筑图像,然而,在这样一个历史规则严谨的领域中,它们再现准确图像的能力仍未得到充分评估。我们评估了五个广泛使用的GenAI图像平台(Adobe Firefly、DALL-E 3、Google Imagen 3、Microsoft Image Generator和Midjourney),使用了涵盖风格、类型学和规范元素的30个建筑提示词。每个提示词-生成器组合生成四张图像(总计n = 600张图像)。两位建筑历史学家根据预定义标准,独立对每张图像的准确性进行评分,并通过共识解决分歧。集合层面的性能以每四张图像集合中准确图像的数量(0至4张)进行汇总。常见提示词的图像输出准确性是罕见提示词的2.7倍(p < 0.05)。在所有平台中,总体准确性有限(最高准确率52%;最低32%;平均42%)。全正确(4/4)的结果在各平台间相似。相比之下,全错误(0/4)的结果差异显著,其中Imagen 3的失败案例最少,而Microsoft Image Generator的失败案例最多。对图像数据集的定性审查识别出反复出现的模式,包括过度装饰、混淆中世纪风格与其后期复兴风格,以及对描述性提示词的错误呈现(例如,卵锚饰、带箍柱、穹隅)。这些发现支持了对GenAI合成内容进行可见标注的必要性、未来训练数据集的来源标准,以及在教育中谨慎使用GenAI建筑图像的建议。