Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies

Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.

翻译：生成式人工智能（GenAI）文本到图像系统正日益广泛地用于生成建筑图像，然而，在这样一个历史规则严谨的领域中，它们再现准确图像的能力仍未得到充分评估。我们评估了五个广泛使用的GenAI图像平台（Adobe Firefly、DALL-E 3、Google Imagen 3、Microsoft Image Generator和Midjourney），使用了涵盖风格、类型学和规范元素的30个建筑提示词。每个提示词-生成器组合生成四张图像（总计n = 600张图像）。两位建筑历史学家根据预定义标准，独立对每张图像的准确性进行评分，并通过共识解决分歧。集合层面的性能以每四张图像集合中准确图像的数量（0至4张）进行汇总。常见提示词的图像输出准确性是罕见提示词的2.7倍（p < 0.05）。在所有平台中，总体准确性有限（最高准确率52%；最低32%；平均42%）。全正确（4/4）的结果在各平台间相似。相比之下，全错误（0/4）的结果差异显著，其中Imagen 3的失败案例最少，而Microsoft Image Generator的失败案例最多。对图像数据集的定性审查识别出反复出现的模式，包括过度装饰、混淆中世纪风格与其后期复兴风格，以及对描述性提示词的错误呈现（例如，卵锚饰、带箍柱、穹隅）。这些发现支持了对GenAI合成内容进行可见标注的必要性、未来训练数据集的来源标准，以及在教育中谨慎使用GenAI建筑图像的建议。