The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: \url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.
翻译:成功的胸部X光分析人工智能模型的开发依赖于大规模、多样化的高质量标注数据集。尽管已有多个胸部X光图像数据库发布,但大多数仅包含疾病诊断标签,缺乏详细的像素级解剖分割标签。为弥补这一不足,我们引入了一个包含统一精细解剖标注的大规模多中心胸部X光分割数据集,其图像源自六个知名的公开数据库:CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR,共计生成676,803个分割掩模。我们的方法采用HybridGNet模型,以确保所有数据集上分割结果的一致性和高质量。通过严格的验证流程(包括专家医师评估和自动质量控制)对生成的掩模进行验证。此外,我们为每个掩模提供个体化质量指标,并为每个数据集提供整体质量评估。该数据集为更广泛的科学界提供了宝贵资源,有助于推动胸部X光分析创新方法的开发与评估。CheXmask数据集已在以下地址公开:\url{https://physionet.org/content/chexmask-cxr-segmentation-data/}。