The unreasonable effectiveness of AI CADe polyp detectors to generalize to new countries

Joel Shor,Hiro-o Yamano,Daisuke Tsurumaru,Yotami Intrator,Hiroki Kayama,Joe Ledsam,Atsushi Hamabe,Koji Ando,Mitsuhiko Ota,Haruei Ogino,Hiroshi Nakase,Kaho Kobayashi,Eiji Oki,Roman Goldenberg,Ehud Rivlin,Ichiro Takemasa

$\textbf{Background and aims}$: Artificial Intelligence (AI) Computer-Aided Detection (CADe) is commonly used for polyp detection, but data seen in clinical settings can differ from model training. Few studies evaluate how well CADe detectors perform on colonoscopies from countries not seen during training, and none are able to evaluate performance without collecting expensive and time-intensive labels. $\textbf{Methods}$: We trained a CADe polyp detector on Israeli colonoscopy videos (5004 videos, 1106 hours) and evaluated on Japanese videos (354 videos, 128 hours) by measuring the True Positive Rate (TPR) versus false alarms per minute (FAPM). We introduce a colonoscopy dissimilarity measure called "MAsked mediCal Embedding Distance" (MACE) to quantify differences between colonoscopies, without labels. We evaluated CADe on all Japan videos and on those with the highest MACE. $\textbf{Results}$: MACE correctly quantifies that narrow-band imaging (NBI) and chromoendoscopy (CE) frames are less similar to Israel data than Japan whitelight (bootstrapped z-test, |z| > 690, p < $10^{-8}$ for both). Despite differences in the data, CADe performance on Japan colonoscopies was non-inferior to Israel ones without additional training (TPR at 0.5 FAPM: 0.957 and 0.972 for Israel and Japan; TPR at 1.0 FAPM: 0.972 and 0.989 for Israel and Japan; superiority test t > 45.2, p < $10^{-8}$). Despite not being trained on NBI or CE, TPR on those subsets were non-inferior to Japan overall (non-inferiority test t > 47.3, p < $10^{-8}$, $\delta$ = 1.5% for both). $\textbf{Conclusion}$: Differences that prevent CADe detectors from performing well in non-medical settings do not degrade the performance of our AI CADe polyp detector when applied to data from a new country. MACE can help medical AI models internationalize by identifying the most "dissimilar" data on which to evaluate models.

翻译：$\textbf{背景与目标}$：人工智能计算机辅助检测（AI CADe）广泛应用于息肉检测，但临床实际数据可能与模型训练数据存在差异。目前仅有少数研究评估了CADe检测器在训练阶段未涉及的国家肠镜数据上的表现，且尚无研究能在不收集昂贵且耗时的标注数据的情况下评估其性能。$\textbf{方法}$：我们在以色列肠镜视频（5004段视频，1106小时）上训练了CADe息肉检测器，并在日本视频（354段视频，128小时）上通过测量真阳性率（TPR）与每分钟误报警次数（FAPM）进行评估。我们提出了一种名为“掩膜医学嵌入距离”（MACE）的肠镜不相似度指标，可在无标注情况下量化肠镜数据差异。我们评估了CADe在所有日本视频及MACE评分最高的日本视频上的表现。$\textbf{结果}$：MACE准确量化了窄带成像（NBI）和染色内镜（CE）帧与以色列数据的相似度低于日本白光数据（自助z检验，|z|>690，两者p<$10^{-8}$）。尽管存在数据差异，CADe在日本肠镜上的表现非劣于以色列数据（无需额外训练）：0.5 FAPM下的TPR：以色列为0.957，日本为0.972；1.0 FAPM下的TPR：以色列为0.972，日本为0.989（优效性检验t>45.2，p<$10^{-8}$）。尽管未经NBI或CE训练，模型在这些子集上的TPR非劣于日本整体数据（非劣效性检验t>47.3，p<$10^{-8}$，$\delta$=1.5%）。$\textbf{结论}$：导致CADe检测器在非医疗领域性能下降的数据差异，并未削弱我们的AI CADe息肉检测器在新国家数据上的表现。MACE通过识别最“不相似”的数据用于模型评估，有助于推动医学AI模型的国际化。