The ambiguity between generalization and memorization in TTI diffusion models becomes pronounced when prompts invoke culturally shared visual references, a phenomenon we term multimodal iconicity. These are instances in which images and texts reflect established cultural associations, such as when a title recalls a familiar artwork or film scene. Such cases challenge existing approaches to evaluating memorization, as they define a setting in which instance-level memorization and culturally grounded generalization are structurally intertwined. To address this challenge, we propose an evaluation framework to assess a model's ability to remain culturally grounded without relying on visual replication. Specifically, we introduce the Cultural Reference Transformation (CRT) metric, which separates two dimensions of model behavior: Recognition, whether a model evokes a reference, from Realization, how it depicts it through replication or reinterpretation. We evaluate five diffusion models on 767 Wikidata-derived cultural references, covering both still and moving imagery, and find differences in how they respond to multimodal iconicity: some show weaker recognition, while others rely more heavily on replication. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, we find that cultural reference recognition correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our findings show that the behavior of diffusion models in culturally iconic settings cannot be reduced to simple reproduction, but depends on how references are recognized and realized, advancing evaluation beyond simple text-image matching toward richer contextual understanding.
翻译:当提示词唤起文化共享的视觉参照时,文本到图像扩散模型中泛化与记忆之间的模糊性变得尤为显著,我们将这一现象称为多模态象似性。这类情形指图像与文本反映既定文化关联的实例,例如当某个标题使人联想到熟悉的艺术作品或电影场景时。此类案例对现有记忆评估方法构成了挑战,因为它们定义了一种情境,其中实例级记忆与文化根基的泛化在结构上相互交织。为应对这一挑战,我们提出一个评估框架,用以衡量模型在不依赖视觉复制的情况下保持文化根基的能力。具体而言,我们引入了文化参照转换指标,该指标将模型行为分为两个维度:识别(模型是否唤起参照)与实现(模型通过复制或重新诠释如何描绘参照)。我们在767个源自Wikidata的文化参照上评估了五个扩散模型,涵盖静态与动态影像,并发现它们对多模态象似性的响应存在差异:部分模型表现出较弱的识别能力,而其他模型则更依赖复制。为评估语言敏感性,我们通过同义词替换和字面图像描述进行了提示扰动实验,发现即使文本线索被改变,模型仍经常重现标志性的视觉结构。最后,我们发现文化参照识别不仅与训练数据频率相关,还与文本独特性、参照流行度及创建日期有关。我们的研究结果表明,扩散模型在文化标志性情境中的行为不能简单归结为复制,而是取决于参照如何被识别与实现,从而推动评估从简单的文本-图像匹配向更丰富的语境理解迈进。