Deep Learning models have been successfully utilized to extract clinically actionable insights from routinely available histology data. Generally, these models require annotations performed by clinicians, which are scarce and costly to generate. The emergence of self-supervised learning (SSL) methods remove this barrier, allowing for large-scale analyses on non-annotated data. However, recent SSL approaches apply increasingly expansive model architectures and larger datasets, causing the rapid escalation of data volumes, hardware prerequisites, and overall expenses, limiting access to these resources to few institutions. Therefore, we investigated the complexity of contrastive SSL in computational pathology in relation to classification performance with the utilization of consumer-grade hardware. Specifically, we analyzed the effects of adaptations in data volume, architecture, and algorithms on downstream clas- sification tasks, emphasizing their impact on computational resources. We trained breast cancer foundation models on a large public patient cohort and validated them on various downstream classification tasks in a weakly supervised manner on two external public patient cohorts. Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%. In summary, we propose a set of adaptations which enable the utilization of SSL in computational pathology in non-resource abundant environments.
翻译:深度学习模型已成功用于从常规组织学数据中提取临床可操作信息。然而,这类模型通常需要临床医师进行标注,而标注数据稀缺且生成成本高昂。自监督学习方法的出现消除了这一障碍,使得在大规模非标注数据上的分析成为可能。但近年来的自监督学习方法采用日益庞大的模型架构和数据集,导致数据量、硬件需求及总体成本急剧攀升,使得仅有少数机构能够获取这些资源。为此,我们探究了计算病理学中对比自监督学习的复杂度与下游分类性能之间的关系,并基于消费级硬件开展实验。具体而言,我们分析了数据量、架构和算法调整对下游分类任务的影响,重点关注其对计算资源的消耗。我们基于大型公开患者队列训练了乳腺癌基础模型,并在两个外部公开患者队列上以弱监督方式验证了其在多种下游分类任务中的表现。实验表明,我们能够在将自监督学习训练时长减少90%的同时提升下游分类性能。总结而言,我们提出了一系列调整方案,使得在非资源富足环境中也能在计算病理学中应用自监督学习。