CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: \url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.

翻译：成功的胸部X光分析人工智能模型的开发依赖于大规模、多样化的高质量标注数据集。尽管已有多个胸部X光图像数据库发布，但大多数仅包含疾病诊断标签，缺乏详细的像素级解剖分割标签。为弥补这一不足，我们引入了一个包含统一精细解剖标注的大规模多中心胸部X光分割数据集，其图像源自六个知名的公开数据库：CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR，共计生成676,803个分割掩模。我们的方法采用HybridGNet模型，以确保所有数据集上分割结果的一致性和高质量。通过严格的验证流程（包括专家医师评估和自动质量控制）对生成的掩模进行验证。此外，我们为每个掩模提供个体化质量指标，并为每个数据集提供整体质量评估。该数据集为更广泛的科学界提供了宝贵资源，有助于推动胸部X光分析创新方法的开发与评估。CheXmask数据集已在以下地址公开：\url{https://physionet.org/content/chexmask-cxr-segmentation-data/}。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

《用于无线通信和传感的智能反射面 (IRS)》（ICC 2022）新加坡国立大学2022最新53页slides

专知会员服务

26+阅读 · 2022年11月16日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日