Polyps in the colon are widely known cancer precursors identified by colonoscopy. Whilst most polyps are benign, the polyp's number, size and surface structure are linked to the risk of colon cancer. Several methods have been developed to automate polyp detection and segmentation. However, the main issue is that they are not tested rigorously on a large multicentre purpose-built dataset, one reason being the lack of a comprehensive public dataset. As a result, the developed methods may not generalise to different population datasets. To this extent, we have curated a dataset from six unique centres incorporating more than 300 patients. The dataset includes both single frame and sequence data with 3762 annotated polyp labels with precise delineation of polyp boundaries verified by six senior gastroenterologists. To our knowledge, this is the most comprehensive detection and pixel-level segmentation dataset (referred to as \textit{PolypGen}) curated by a team of computational scientists and expert gastroenterologists. The paper provides insight into data construction and annotation strategies, quality assurance, and technical validation. Our dataset can be downloaded from \url{ https://doi.org/10.7303/syn26376615}.
翻译:结肠息肉是结肠镜检查中公认的癌症前体病变。虽然大多数息肉为良性,但其数量、大小和表面结构与结直肠癌风险相关。目前已有多种方法被开发用于自动检测和分割息肉,但主要问题在于这些方法未能在大型多中心专用数据集上得到严格测试,其原因之一是缺乏全面的公开数据集。因此,现有方法可能难以在不同人群数据集上实现良好泛化。针对这一问题,我们整合了来自六个不同医疗中心的超过300例患者数据,构建了包含单帧图像和序列数据的数据集,其中标注了3762个息肉标签,并由六位资深胃肠病专家验证了精确的息肉边界勾画。据我们所知,这是由计算科学家和消化内镜专家团队共同构建的最全面的检测与像素级分割数据集(称为\textit{PolypGen})。本文详细阐述了数据构建与标注策略、质量保证体系以及技术验证过程。该数据集可通过\url{https://doi.org/10.7303/syn26376615}下载。