A multi-centre polyp detection and segmentation dataset for generalisability assessment

Sharib Ali,Debesh Jha,Noha Ghatwary,Stefano Realdon,Renato Cannizzaro,Osama E. Salem,Dominique Lamarque,Christian Daul,Michael A. Riegler,Kim V. Anonsen,Andreas Petlund,Pål Halvorsen,Jens Rittscher,Thomas de Lange,James E. East

from arxiv, 19 pages

Polyps in the colon are widely known cancer precursors identified by colonoscopy. Whilst most polyps are benign, the polyp's number, size and surface structure are linked to the risk of colon cancer. Several methods have been developed to automate polyp detection and segmentation. However, the main issue is that they are not tested rigorously on a large multicentre purpose-built dataset, one reason being the lack of a comprehensive public dataset. As a result, the developed methods may not generalise to different population datasets. To this extent, we have curated a dataset from six unique centres incorporating more than 300 patients. The dataset includes both single frame and sequence data with 3762 annotated polyp labels with precise delineation of polyp boundaries verified by six senior gastroenterologists. To our knowledge, this is the most comprehensive detection and pixel-level segmentation dataset (referred to as \textit{PolypGen}) curated by a team of computational scientists and expert gastroenterologists. The paper provides insight into data construction and annotation strategies, quality assurance, and technical validation. Our dataset can be downloaded from \url{ https://doi.org/10.7303/syn26376615}.

翻译：结肠息肉是结肠镜检查中公认的癌症前体病变。虽然大多数息肉为良性，但其数量、大小和表面结构与结直肠癌风险相关。目前已有多种方法被开发用于自动检测和分割息肉，但主要问题在于这些方法未能在大型多中心专用数据集上得到严格测试，其原因之一是缺乏全面的公开数据集。因此，现有方法可能难以在不同人群数据集上实现良好泛化。针对这一问题，我们整合了来自六个不同医疗中心的超过300例患者数据，构建了包含单帧图像和序列数据的数据集，其中标注了3762个息肉标签，并由六位资深胃肠病专家验证了精确的息肉边界勾画。据我们所知，这是由计算科学家和消化内镜专家团队共同构建的最全面的检测与像素级分割数据集（称为\textit{PolypGen}）。本文详细阐述了数据构建与标注策略、质量保证体系以及技术验证过程。该数据集可通过\url{https://doi.org/10.7303/syn26376615}下载。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日