Data Set Terminology of Deep Learning in Medicine: A Historical Review and Recommendation

Medicine and deep learning-based artificial intelligence (AI) engineering represent two distinct fields each with decades of published history. With such history comes a set of terminology that has a specific way in which it is applied. However, when two distinct fields with overlapping terminology start to collaborate, miscommunication and misunderstandings can occur. This narrative review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts, and offer solutions to mitigate misunderstandings by readers from either field. Through an examination of historical documents, including articles, writing guidelines, and textbooks, this review traces the divergent evolution of terms for data sets and their impact. Initially, the discordant interpretations of the word 'validation' in medical and AI contexts are explored. Then the data sets used for AI evaluation are classified, namely random splitting, cross-validation, temporal, geographic, internal, and external sets. The accurate and standardized description of these data sets is crucial for demonstrating the robustness and generalizability of AI applications in medicine. This review clarifies existing literature to provide a comprehensive understanding of these classifications and their implications in AI evaluation. This review then identifies often misunderstood terms and proposes pragmatic solutions to mitigate terminological confusion. Among these solutions are the use of standardized terminology such as 'training set,' 'validation (or tuning) set,' and 'test set,' and explicit definition of data set splitting terminologies in each medical AI research publication. This review aspires to enhance the precision of communication in medical AI, thereby fostering more effective and transparent research methodologies in this interdisciplinary field.

翻译：医学与基于深度学习的人工智能（AI）工程学是两个各自拥有数十年发表历史的独立领域。伴随这一历史进程，每个领域都形成了一套具有特定应用方式的术语体系。然而，当两个术语存在重叠的领域开始协作时，就可能产生沟通不畅与误解。本文通过叙述性综述，旨在为这些术语提供历史背景，强调在医学AI语境中使用这些术语时清晰度的重要性，并提出缓解来自任一领域读者误解的解决方案。通过检视包括文章、写作指南和教科书在内的历史文献，本综述追溯了数据集术语的分化演变及其影响。首先，探讨了“验证”一词在医学与AI语境中的歧义解读。随后对用于AI评估的数据集进行了分类，即随机划分、交叉验证、时间划分、地理划分、内部数据集与外部数据集。准确且标准化地描述这些数据集对于证明AI在医学应用中鲁棒性与泛化能力至关重要。本综述通过澄清现有文献，为这些分类及其在AI评估中的意义提供了全面理解。进一步地，本文识别了常被误解的术语，并提出了缓解术语混淆的实用解决方案，包括采用“训练集”“验证（或调优）集”“测试集”等标准化术语，以及在每项医学AI研究发表中明确定义数据集划分术语。本综述期望提升医学AI领域交流的精确性，从而促进这一跨学科领域更有效、更透明的研究方法。