Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.
翻译:图像生成模型在输出过程中会复制训练数据中的图像,这在医学图像生成应用中可能引发隐私问题。我们提出一种校准的逐样本度量方法,用于检测训练数据的记忆与复制行为。该度量利用MRI基础模型提取图像特征,聚合多层白化最近邻相似度,并将其映射至有界的\emph{过拟合/新颖性指数}(ONI)与\emph{记忆指数}(MI)评分。在三个包含可控复制比例及典型图像增强的MRI数据集上,本度量方法能稳健地检测复制行为,并在不同数据集间提供更一致的度量值。在样本层面,该度量方法实现了近乎完美的重复样本检测。