Systematic comparison of semi-supervised and self-supervised learning for medical image classification

In many medical image classification problems, labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on equal footing. Furthermore, past benchmarks often handle hyperparameter tuning suboptimally. First, they may not tune hyperparameters at all, leading to underfitting. Second, when tuning does occur, it often unrealistically uses a labeled validation set much larger than the train set. Both cases make previously published rankings of methods difficult to translate to practical settings. This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so, when all methods are tuned well, which self- or semi-supervised methods reach the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ total GPU hours of computation, we provide valuable best practices to resource-constrained, results-focused practitioners.

翻译：在许多医学图像分类问题中，标注数据稀缺而未标注数据相对丰富。半监督学习和自监督学习是两种不同的研究方向，可通过利用额外未标注数据提升准确率。近年来，这两类方法在传统基准测试中均取得了显著进展。然而，现有基准测试并未聚焦医学任务，且很少在同一标准下对自监督与半监督方法进行公平比较。此外，现有基准测试在超参数调优方面存在不足：首先，它们可能完全忽略调优，导致欠拟合；其次，即便进行调优，往往不切实际地使用比训练集大得多的标注验证集。这两种情况使得已发表的方法排名难以在实际场景中复现。本研究通过统一实验协议，对自监督与半监督方法进行系统评估，旨在指导在整体标注数据稀缺且计算预算有限的情况下进行实践应用。我们回答两个关键问题：在验证集规模符合实际的情况下，超参数调优是否有效？若有效，当所有方法均经过充分调优后，哪些自监督或半监督方法能达到最佳准确率？研究将13种代表性半监督与自监督方法与强标注基线在4个医学数据集上进行比较。基于超过2万GPU小时的计算，我们为资源受限、以结果为导向的实践者提供了宝贵的实践指南。