Robust Semi-Supervised Learning for Histopathology Images through Self-Supervision Guided Out-of-Distribution Scoring

Semi-supervised learning (semi-SL) is a promising alternative to supervised learning for medical image analysis when obtaining good quality supervision for medical imaging is difficult. However, semi-SL assumes that the underlying distribution of unaudited data matches that of the few labeled samples, which is often violated in practical settings, particularly in medical images. The presence of out-of-distribution (OOD) samples in the unlabeled training pool of semi-SL is inevitable and can reduce the efficiency of the algorithm. Common preprocessing methods to filter out outlier samples may not be suitable for medical images that involve a wide range of anatomical structures and rare morphologies. In this paper, we propose a novel pipeline for addressing open-set supervised learning challenges in digital histology images. Our pipeline efficiently estimates an OOD score for each unlabelled data point based on self-supervised learning to calibrate the knowledge needed for a subsequent semi-SL framework. The outlier score derived from the OOD detector is used to modulate sample selection for the subsequent semi-SL stage, ensuring that samples conforming to the distribution of the few labeled samples are more frequently exposed to the subsequent semi-SL framework. Our framework is compatible with any semi-SL framework, and we base our experiments on the popular Mixmatch semi-SL framework. We conduct extensive studies on two digital pathology datasets, Kather colorectal histology dataset and a dataset derived from TCGA-BRCA whole slide images, and establish the effectiveness of our method by comparing with popular methods and frameworks in semi-SL algorithms through various experiments.

翻译：半监督学习（Semi-SL）是医学图像分析中一种有前景的有监督学习替代方案，尤其适用于难以获取高质量标注的医疗影像场景。然而，半监督学习假设未标注数据的潜在分布与少量标注样本的分布一致，这一假设在实际场景（尤其是医学图像）中常被违反。半监督学习未标注训练池中分布外（OOD）样本的出现不可避免，且会降低算法效率。针对医学图像中广泛存在的解剖结构差异与罕见形态特征，常规的异常样本预处理过滤方法可能并不适用。本文提出了一种面向数字组织病理学图像的开集监督学习问题的新型处理流程。该流程基于自监督学习为每个未标注数据点高效估计分布外评分，从而校准后续半监督框架所需的知识。通过分布外检测器导出的异常评分，我们可调制后续半监督阶段的样本选择策略，确保符合少量标注样本分布的训练样本更频繁地暴露于后续半监督框架中。本框架兼容任意半监督学习框架，实验基于流行的Mixmatch半监督框架开展。我们在两个数字病理学数据集（Kather结直肠组织病理学数据集及源自TCGA-BRCA全切片图像的数据集）上进行了广泛研究，通过半监督算法中主流方法与框架的对比实验，验证了所提方法的有效性。