Histopathological images are widely used for the analysis of diseased (tumor) tissues and patient treatment selection. While the majority of microscopy image processing was previously done manually by pathologists, recent advances in computer vision allow for accurate recognition of lesion regions with deep learning-based solutions. Such models, however, usually require extensive annotated datasets for training, which is often not the case in the considered task, where the number of available patient data samples is very limited. To deal with this problem, we propose a novel DeepCMorph model pre-trained to learn cell morphology and identify a large number of different cancer types. The model consists of two modules: the first one performs cell nuclei segmentation and annotates each cell type, and is trained on a combination of 8 publicly available datasets to ensure its high generalizability and robustness. The second module combines the obtained segmentation map with the original microscopy image and is trained for the downstream task. We pre-trained this module on the Pan-Cancer TCGA dataset consisting of over 270K tissue patches extracted from 8736 diagnostic slides from 7175 patients. The proposed solution achieved a new state-of-the-art performance on the dataset under consideration, detecting 32 cancer types with over 82% accuracy and outperforming all previously proposed solutions by more than 4%. We demonstrate that the resulting pre-trained model can be easily fine-tuned on smaller microscopy datasets, yielding superior results compared to the current top solutions and models initialized with ImageNet weights. The codes and pre-trained models presented in this paper are available at: https://github.com/aiff22/DeepCMorph
翻译:组织病理学图像广泛应用于病变(肿瘤)组织分析和患者治疗方案选择。尽管以往多数显微图像处理工作由病理学家手动完成,但近期计算机视觉领域的进展使得基于深度学习的解决方案能够准确识别病灶区域。然而,此类模型通常需要大量标注数据集进行训练,而在本研究所针对的任务中,可用患者数据样本数量极为有限。为解决该问题,我们提出了一种新颖的DeepCMorph模型,通过预训练学习细胞形态并识别多种不同癌症类型。该模型包含两个模块:第一个模块执行细胞核分割并对每种细胞类型进行标注,该模块在8个公开数据集的组合上进行训练以确保其高度泛化性和鲁棒性;第二个模块将获得的分割图与原始显微图像相结合,并针对下游任务进行训练。我们在包含7175名患者、8736张诊断切片提取的超过27万组织切片的泛癌症TCGA数据集上对该模块进行了预训练。所提出的解决方案在目标数据集上实现了新的最优性能,以超过82%的准确率检测32种癌症类型,较所有先前提出的解决方案性能提升超过4%。我们证明所得预训练模型能够轻松在较小显微数据集上进行微调,相比当前最优解决方案及使用ImageNet权重初始化的模型可获得更优越的结果。本文提出的代码与预训练模型发布于:https://github.com/aiff22/DeepCMorph