Copycats: the many lives of a publicly available medical imaging dataset

Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

翻译：医学影像数据集是医疗人工智能的基础。诊断算法的准确性、鲁棒性和公平性取决于用于训练和评估模型的数据（及其质量）。医学影像数据集曾多为专有资源，但已日益向公众开放，包括通过Kaggle或HuggingFace等社区贡献平台。虽然开放数据对提升数据公共价值的再分配至关重要，但我们发现当前社区贡献平台的治理模式未能维持数据共享、文档记录和评估所需的质量标准及推荐实践。本文通过对社区贡献平台上公开机器学习数据集的分析，探讨数据集背景，并指出现有平台生态中的局限与缺口。我们重点阐释医学影像数据集与计算机视觉数据集间的差异，特别强调不当采用推荐数据集管理实践可能引发的有害下游影响。我们从数据共享、数据文档化和维护等多个维度比较所分析的数据集，发现存在许可协议模糊、缺乏持久标识符与存储机制、数据重复及元数据缺失等问题，且不同平台间存在差异。本研究为医疗领域负责任的数据策管与人工智能算法发展提供了参考依据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【TPAMI2020】目标检测中的不平衡问题:综述论文，34页pdf

专知会员服务

55+阅读 · 2020年3月16日