Benchmarking Foundation Models for Mitotic Figure Classification

The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

翻译：众所周知，深度学习模型的性能会随着数据量和多样性的增加而提升。在病理学以及许多其他医学影像领域中，针对特定任务的标注图像可用性通常有限。自监督学习技术使得能够利用大量未标注数据来训练大规模神经网络，即基础模型。这些模型通过提供语义丰富的特征向量，能够以最小的训练成本很好地泛化到新任务上，从而提高模型性能和鲁棒性，从而解决数据有限的问题。在本工作中，我们研究了基础模型在有丝分裂像分类中的应用。从该分类任务中可推导出的有丝分裂计数，是特定肿瘤的独立预后标志物，也是某些肿瘤分级系统的一部分。具体而言，我们研究了多个当前基础模型的数据缩放规律，并评估了它们对未见肿瘤域的鲁棒性。除了常用的线性探测范式外，我们还通过对其注意力机制进行低秩适应（LoRA）来调整模型。我们将所有模型与端到端训练的基线模型（包括CNN和Vision Transformer）进行了比较。我们的结果表明，经过LoRA调整的基础模型比使用标准线性探测调整的模型具有更优的性能，仅使用10%的训练数据即可达到接近100%数据可用性的性能水平。此外，对最新基础模型进行LoRA调整，在未见肿瘤域上评估时，几乎消除了域外性能差距。然而，对传统架构进行完全微调仍然能获得有竞争力的性能。