The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: https://physionet.org/content/chexmask-cxr-segmentation-data/
翻译:成功开发用于胸部X光分析的人工智能模型依赖于包含高质量标注的大规模多样化数据集。尽管已有多个胸部X光影像数据库公开,但多数仅包含疾病诊断标签,缺乏详细的像素级解剖分割标签。为弥补这一不足,我们提出了一个大规模胸部X光多中心分割数据集,该数据集为来自六个知名公开数据库(CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR)的影像提供统一且精细的解剖标注,共生成676,803个分割掩膜。我们的方法采用HybridGNet模型以确保跨所有数据集的一致且高质量分割。通过包含专家医师评估和自动质量控制的严格验证,对生成的掩膜进行了确认。此外,我们为每个掩膜提供个体化质量指数,并为每个数据集提供整体质量评估。该数据集可作为广大科学界的宝贵资源,有助于简化胸部X光分析中创新方法的开发与评估。CheXmask数据集公开发布于:https://physionet.org/content/chexmask-cxr-segmentation-data/