The NCI Imaging Data Commons as a platform for reproducible research in computational pathology

Daniela P. Schacherer,Markus D. Herrmann,David A. Clunie,Henning Höfener,William Clifford,William J. R. Longabaugh,Steve Pieper,Ron Kikinis,Andrey Fedorov,André Homeyer

Objective: Reproducibility is critical for translating machine learning-based (ML) solutions in computational pathology (CompPath) into practice. However, an increasing number of studies report difficulties in reproducing ML results. The NCI Imaging Data Commons (IDC) is a public repository of >120 cancer image collections, including >38,000 whole-slide images (WSIs), that is designed to be used with cloud-based ML services. Here, we explore the potential of the IDC to facilitate reproducibility of CompPath research. Materials and Methods: The IDC realizes the FAIR principles: All images are encoded according to the DICOM standard, persistently identified, discoverable via rich metadata, and accessible via open tools. Taking advantage of this, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets from the IDC. To assess reproducibility, the experiments were run multiple times with independent but identically configured sessions of common ML services. Results: The AUC values of different runs of the same experiment were generally consistent and in the same order of magnitude as a similar, previously published study. However, there were occasional small variations in AUC values of up to 0.044, indicating a practical limit to reproducibility. Discussion and conclusion: By realizing the FAIR principles, the IDC enables other researchers to reuse exactly the same datasets. Cloud-based ML services enable others to run CompPath experiments in an identically configured computing environment without having to own high-performance hardware. The combination of both makes it possible to approach the reproducibility limit.

翻译：目的：可重复性是将基于机器学习（ML）的计算病理学（CompPath）解决方案转化为临床实践的关键。然而，越来越多的研究报告称难以复现ML结果。NCI成像数据共享平台（IDC）是一个包含120余个癌症影像数据集（涵盖超过38,000张全切片图像（WSI））的公共存储库，专为与云ML服务协同使用而设计。本文旨在探索IDC促进CompPath研究可重复性的潜力。材料与方法：IDC遵循FAIR原则：所有图像均按DICOM标准编码，具有持久标识符，可通过丰富元数据发现，并可通过开放工具访问。基于此，我们设计了两组实验，采用代表性ML方法在IDC的不同数据集上训练和/或评估肺肿瘤组织分类任务。为评估可重复性，我们在独立但配置相同的常见云ML服务环境中多次重复运行实验。结果：同一实验不同运行轮次的AUC值总体一致，并与类似既往研究结果处于同一量级。但偶见AUC值存在最高0.044的微小波动，表明可重复性存在实际限制。讨论与结论：通过践行FAIR原则，IDC使其他研究者能够精确复用相同数据集；基于云的ML服务则允许其他人员在配置一致的计算环境中运行CompPath实验，无需自持高性能硬件。二者的结合使得接近可重复性极限成为可能。

相关内容

IDC

关注 6

Interaction Design and Children是研究人员、教育工作者和实践者的首次国际会议，旨在分享包容性儿童中心设计、学习和互动领域的最新研究成果、创新方法和新技术。年会包括论文、专题介绍、发言者、讲习班、参与性设计经验以及讨论如何为儿童创造更好的互动经验。官网链接：http://idc.acm.org/2019/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日