Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.
翻译:医学影像数据集是医疗人工智能的基础。诊断算法的准确性、鲁棒性和公平性取决于用于训练和评估模型的数据(及其质量)。医学影像数据集曾多为专有资源,但已日益向公众开放,包括通过Kaggle或HuggingFace等社区贡献平台。虽然开放数据对提升数据公共价值的再分配至关重要,但我们发现当前社区贡献平台的治理模式未能维持数据共享、文档记录和评估所需的质量标准及推荐实践。本文通过对社区贡献平台上公开机器学习数据集的分析,探讨数据集背景,并指出现有平台生态中的局限与缺口。我们重点阐释医学影像数据集与计算机视觉数据集间的差异,特别强调不当采用推荐数据集管理实践可能引发的有害下游影响。我们从数据共享、数据文档化和维护等多个维度比较所分析的数据集,发现存在许可协议模糊、缺乏持久标识符与存储机制、数据重复及元数据缺失等问题,且不同平台间存在差异。本研究为医疗领域负责任的数据策管与人工智能算法发展提供了参考依据。