Benchmarking Foundation Models for Mitotic Figure Classification

The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

翻译：众所周知，深度学习模型的性能会随着数据量和多样性的增加而提升。在病理学领域，如同许多其他医学影像领域一样，针对特定任务的标注图像往往有限。自监督学习技术使得利用大量未标注数据训练大规模神经网络成为可能，即基础模型。这些模型通过提供语义丰富的特征向量，能够以最小的训练成本很好地泛化到新任务，从而解决数据有限的问题，同时提高模型性能和鲁棒性。在本研究中，我们探讨了基础模型在有丝分裂图像分类中的应用。通过该分类任务可推导出的有丝分裂计数，是特定肿瘤的独立预后标志物，也是某些肿瘤分级系统的一部分。具体而言，我们研究了多个当前主流基础模型的数据缩放规律，并评估了它们对未见肿瘤域的鲁棒性。除了常用的线性探测范式外，我们还通过对其注意力机制进行低秩自适应（LoRA）来调整模型。我们将所有模型与端到端训练的基线模型（包括CNN和Vision Transformer）进行了比较。我们的结果表明，经过LoRA自适应调整的基础模型性能优于采用标准线性探测调整的模型，仅使用10%的训练数据即可达到接近100%数据可用性的性能水平。此外，对最新基础模型进行LoRA自适应调整后，在未见肿瘤域上评估时，几乎消除了域外性能差距。然而，传统架构的完全微调仍然表现出具有竞争力的性能。