Near- and duplicate image detection is a critical concern in the field of medical imaging. Medical datasets often contain similar or duplicate images from various sources, which can lead to significant performance issues and evaluation biases, especially in machine learning tasks due to data leakage between training and testing subsets. In this paper, we present an approach for identifying near- and duplicate 3D medical images leveraging publicly available 2D computer vision embeddings. We assessed our approach by comparing embeddings extracted from two state-of-the-art self-supervised pretrained models and two different vector index structures for similarity retrieval. We generate an experimental benchmark based on the publicly available Medical Segmentation Decathlon dataset. The proposed method yields promising results for near- and duplicate image detection achieving a mean sensitivity and specificity of 0.9645 and 0.8559, respectively.
翻译:近重复图像检测是医学影像领域的关键问题。医学数据集常包含来自不同来源的相似或重复图像,这可能导致显著的性能问题和评估偏差,尤其在机器学习任务中因训练集与测试集之间的数据泄露而更为突出。本文提出一种利用公开二维计算机视觉嵌入识别三维医学图像近重复的方法。我们通过比较两种最新自监督预训练模型提取的嵌入,以及两种不同向量索引结构进行相似性检索,对方法进行了评估。基于公开的Medical Segmentation Decathlon数据集构建实验基准。该方法在近重复图像检测中展现出良好效果,平均灵敏度达0.9645,平均特异性达0.8559。