The unreasonable effectiveness of AI CADe polyp detectors to generalize to new countries

Joel Shor,Hiro-o Yamano,Daisuke Tsurumaru,Yotami Intrator,Hiroki Kayama,Joe Ledsam,Atsushi Hamabe,Koji Ando,Mitsuhiko Ota,Haruei Ogino,Hiroshi Nakase,Kaho Kobayashi,Eiji Oki,Roman Goldenberg,Ehud Rivlin,Ichiro Takemasa

$\textbf{Background and aims}$: Artificial Intelligence (AI) Computer-Aided Detection (CADe) is commonly used for polyp detection, but data seen in clinical settings can differ from model training. Few studies evaluate how well CADe detectors perform on colonoscopies from countries not seen during training, and none are able to evaluate performance without collecting expensive and time-intensive labels. $\textbf{Methods}$: We trained a CADe polyp detector on Israeli colonoscopy videos (5004 videos, 1106 hours) and evaluated on Japanese videos (354 videos, 128 hours) by measuring the True Positive Rate (TPR) versus false alarms per minute (FAPM). We introduce a colonoscopy dissimilarity measure called "MAsked mediCal Embedding Distance" (MACE) to quantify differences between colonoscopies, without labels. We evaluated CADe on all Japan videos and on those with the highest MACE. $\textbf{Results}$: MACE correctly quantifies that narrow-band imaging (NBI) and chromoendoscopy (CE) frames are less similar to Israel data than Japan whitelight (bootstrapped z-test, |z| > 690, p < $10^{-8}$ for both). Despite differences in the data, CADe performance on Japan colonoscopies was non-inferior to Israel ones without additional training (TPR at 0.5 FAPM: 0.957 and 0.972 for Israel and Japan; TPR at 1.0 FAPM: 0.972 and 0.989 for Israel and Japan; superiority test t > 45.2, p < $10^{-8}$). Despite not being trained on NBI or CE, TPR on those subsets were non-inferior to Japan overall (non-inferiority test t > 47.3, p < $10^{-8}$, $\delta$ = 1.5% for both). $\textbf{Conclusion}$: Differences that prevent CADe detectors from performing well in non-medical settings do not degrade the performance of our AI CADe polyp detector when applied to data from a new country. MACE can help medical AI models internationalize by identifying the most "dissimilar" data on which to evaluate models.

翻译：$\textbf{背景与目的}$：人工智能计算机辅助检测（CADe）常用于息肉检测，但临床环境中的数据可能与模型训练数据存在差异。目前较少研究评估CADe检测器在训练集未包含的国家结肠镜数据上的表现，且尚无方法能在不耗费昂贵且耗时的人工标注的情况下评估其性能。$\textbf{方法}$：我们基于以色列结肠镜视频（5004个视频，1106小时）训练了CADe息肉检测器，并通过测量真阳性率（TPR）与每分钟假警报数（FAPM）在日本视频（354个视频，128小时）上评估其性能。我们引入了一种无需标注的结肠镜差异度量指标——“医学掩码嵌入距离”（MACE），用于量化结肠镜数据间的差异。我们评估了CADe在所有日本视频以及MACE值最高的视频子集上的表现。$\textbf{结果}$：MACE准确量化出窄带成像（NBI）和染色内镜（CE）帧与以色列数据的相似度低于日本白光帧（bootstrap z检验，|z|>690，p<$10^{-8}$）。尽管数据存在差异，CADe在日本结肠镜数据上的表现不劣于以色列数据，且无需额外训练（在0.5 FAPM下TPR：以色列0.957，日本0.972；在1.0 FAPM下TPR：以色列0.972，日本0.989；优效性检验t>45.2，p<$10^{-8}$）。尽管NBI和CE子集未参与训练，其TPR也不劣于日本整体数据（非劣效性检验t>47.3，p<$10^{-8}$，$\delta$=1.5%）。$\textbf{结论}$：导致CADe检测器在非医疗环境下性能下降的数据差异，并未削弱我们的AI CADe息肉检测器在新国家数据上的表现。MACE通过识别最“不相似”的数据作为模型评估对象，可助力医疗AI模型的国际化推广。