Background: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels. Methods: We use a "Masked Siamese Network" (MSN) to identify novel phenomena in unseen data and predict polyp detector performance. MSN is trained to predict masked out regions of polyp images, without any labels. We test MSN's ability to be trained on data only from Israel and detect unseen techniques, narrow-band imaging (NBI) and chromendoscoy (CE), on colonoscopes from Japan (354 videos, 128 hours). We also test MSN's ability to predict performance of Computer Aided Detection (CADe) of polyps on colonoscopies from both countries, even though MSN is not trained on data from Japan. Results: MSN correctly identifies NBI and CE as less similar to Israel whitelight than Japan whitelight (bootstrapped z-test, |z| > 496, p < 10^-8 for both) using the label-free Frechet distance. MSN detects NBI with 99% accuracy, predicts CE better than our heuristic (90% vs 79% accuracy) despite being trained only on whitelight, and is the only method that is robust to noisy labels. MSN predicts CADe polyp detector performance on in-domain Israel and out-of-domain Japan colonoscopies (r=0.79, 0.37 respectively). With few examples of Japan detector performance to train on, MSN prediction of Japan performance improves (r=0.56). Conclusion: Our technique can identify distribution shifts in clinical data and can predict CADe detector performance on unseen data, without labels. Our self-supervised approach can aid in detecting when data in practice is different from training, such as between hospitals or data has meaningfully shifted from training. MSN has potential for application to medical image domains beyond colonoscopy.
翻译:背景:AI结肠镜算法的泛化能力对于其在临床实践中的广泛采用至关重要。然而,当前评估模型在未见数据上性能的技术需要昂贵且耗时的标注。方法:我们采用“掩码孪生网络”(MSN)识别未见数据中的新现象,并预测息肉检测器的性能。MSN通过预测息肉图像中被掩码的区域进行训练,无需任何标注。我们测试了MSN仅使用以色列数据进行训练后,检测来自日本结肠镜(354段视频,128小时)中的未见技术——窄带成像(NBI)和染色内镜(CE)的能力。同时,我们评估了MSN在未使用日本数据训练的情况下,预测两国结肠镜检查中息肉计算机辅助检测(CADe)性能的能力。结果:MSN通过无标签的弗雷歇距离正确识别了NBI和CE与以色列白光内镜的相似度低于日本白光内镜(自助法z检验,|z| > 496,两者p < 10^-8)。MSN检测NBI的准确率达99%,对CE的预测优于我们的启发式方法(准确率90% vs 79%),且仅基于白光内镜数据训练,并且是唯一对噪声标签鲁棒的方法。MSN在领域内(以色列)和领域外(日本)结肠镜数据上预测了CADe息肉检测器的性能(相关系数分别为r=0.79和0.37)。当使用少量日本检测器性能数据进行训练时,MSN对日本性能的预测提升至r=0.56。结论:我们的技术能够在无需标签的情况下识别临床数据中的分布偏移,并预测CADe检测器在未见数据上的性能。这种自监督方法有助于检测实际数据与训练数据的差异,例如不同医院之间的数据差异或数据相较于训练集出现实质性偏移。MSN具备扩展到结肠镜之外医学图像领域的潜力。