CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: https://physionet.org/content/chexmask-cxr-segmentation-data/

翻译：成功开发用于胸部X光分析的人工智能模型依赖于包含高质量标注的大规模多样化数据集。尽管已有多个胸部X光影像数据库公开，但多数仅包含疾病诊断标签，缺乏详细的像素级解剖分割标签。为弥补这一不足，我们提出了一个大规模胸部X光多中心分割数据集，该数据集为来自六个知名公开数据库（CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR）的影像提供统一且精细的解剖标注，共生成676,803个分割掩膜。我们的方法采用HybridGNet模型以确保跨所有数据集的一致且高质量分割。通过包含专家医师评估和自动质量控制的严格验证，对生成的掩膜进行了确认。此外，我们为每个掩膜提供个体化质量指数，并为每个数据集提供整体质量评估。该数据集可作为广大科学界的宝贵资源，有助于简化胸部X光分析中创新方法的开发与评估。CheXmask数据集公开发布于：https://physionet.org/content/chexmask-cxr-segmentation-data/

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

《用于无线通信和传感的智能反射面 (IRS)》（ICC 2022）新加坡国立大学2022最新53页slides

专知会员服务

26+阅读 · 2022年11月16日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日