Generalization in medical AI: a perspective on developing scalable models

Over the past few years, research has witnessed the advancement of deep learning models trained on large datasets, some even encompassing millions of examples. While these impressive performance on their hidden test sets, they often underperform when assessed on external datasets. Recognizing the critical role of generalization in medical AI development, many prestigious journals now require reporting results both on the local hidden test set as well as on external datasets before considering a study for publication. Effectively, the field of medical AI has transitioned from the traditional usage of a single dataset that is split into train and test to a more comprehensive framework using multiple datasets, some of which are used for model development (source domain) and others for testing (target domains). However, this new experimental setting does not necessarily resolve the challenge of generalization. This is because of the variability encountered in intended use and specificities across hospital cultures making the idea of universally generalizable systems a myth. On the other hand, the systematic, and a fortiori recurrent re-calibration, of models at the individual hospital level, although ideal, may be overoptimistic given the legal, regulatory and technical challenges that are involved. Re-calibration using transfer learning may not even be possible in some instances where reference labels of target domains are not available. In this perspective we establish a hierarchical three-level scale system reflecting the generalization level of a medical AI algorithm. This scale better reflects the diversity of real-world medical scenarios per which target domain data for re-calibration of models may or not be available and if it is, may or not have reference labels systematically available.

翻译：过去几年中，基于大规模数据集（部分甚至包含数百万样本）训练的深度学习模型研究取得了显著进展。尽管这些模型在其隐藏测试集上表现优异，但在外部数据集评估时往往性能下降。鉴于泛化能力在医疗AI开发中的关键作用，许多顶级期刊现已要求研究者在报告结果时，同时提供本地隐藏测试集和外部数据集上的表现，方可考虑论文发表。实际上，医疗AI领域已从传统单一数据集（划分为训练集与测试集）的使用模式，转向采用多个数据集的更全面框架——其中部分数据集用于模型开发（源域），其余用于测试（目标域）。然而，这种新实验设定并未真正解决泛化挑战，因为医院文化间的预期用途差异性和具体场景特异性使得"通用泛化系统"成为神话。另一方面，在单个医院层面进行系统性且必然反复的模型校准，尽管理论理想，但考虑到涉及的法律、监管和技术挑战，可能过于乐观。在目标域参考标签不可用的情况下，基于迁移学习的重新校准甚至可能无法实施。本文提出一种三级分层标度系统，用以反映医疗AI算法的泛化水平。该标度能更真实地反映现实医疗场景的多样性——目标域数据是否可用于模型重新校准，以及若可用，其参考标签是否系统性地存在。