Augmenting Chest X-ray Datasets with Non-Expert Annotations

The advancement of machine learning algorithms in medical image analysis requires the expansion of training datasets. A popular and cost-effective approach is automated annotation extraction from free-text medical reports, primarily due to the high costs associated with expert clinicians annotating medical images, such as chest X-rays. However, it has been shown that the resulting datasets are susceptible to biases and shortcuts. Another strategy to increase the size of a dataset is crowdsourcing, a widely adopted practice in general computer vision with some success in medical image analysis. In a similar vein to crowdsourcing, we enhance two publicly available chest X-ray datasets by incorporating non-expert annotations. However, instead of using diagnostic labels, we annotate shortcuts in the form of tubes. We collect 3.5k chest drain annotations for NIH-CXR14, and 1k annotations for four different tube types in PadChest, and create the Non-Expert Annotations of Tubes in X-rays (NEATX) dataset. We train a chest drain detector with the non-expert annotations that generalizes well to expert labels. Moreover, we compare our annotations to those provided by experts and show "moderate" to "almost perfect" agreement. Finally, we present a pathology agreement study to raise awareness about the quality of ground truth annotations. We make our dataset available at https://zenodo.org/records/14944064 and our code available at https://github.com/purrlab/chestxr-label-reliability.

翻译：医学影像分析中机器学习算法的进步需要扩大训练数据集。一种流行且经济高效的方法是从自由文本医疗报告中自动提取标注，这主要源于专家临床医生标注医学影像（如胸部X光）的高昂成本。然而，研究表明由此产生的数据集容易存在偏差和捷径。另一种增加数据集规模的策略是众包，这是在通用计算机视觉领域广泛采用并在医学影像分析中取得一定成功的实践。与众包思路类似，我们通过纳入非专家标注来增强两个公开可用的胸部X光数据集。但我们不使用诊断标签，而是以导管形式标注捷径。我们为NIH-CXR14收集了3.5k个胸腔引流管标注，为PadChest中四种不同类型的导管收集了1k个标注，并创建了X光导管非专家标注数据集。利用这些非专家标注，我们训练了一个胸腔引流管检测器，该检测器能良好泛化至专家标注。此外，我们将非专家标注与专家标注进行比较，结果显示两者具有"中等"至"几乎完全一致"的一致性。最后，我们通过病理一致性研究来提升对真实标注质量的关注度。我们的数据集发布于https://zenodo.org/records/14944064，代码发布于https://github.com/purrlab/chestxr-label-reliability。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日