Deep Learning models have been successfully utilized to extract clinically actionable insights from routinely available histology data. Generally, these models require annotations performed by clinicians, which are scarce and costly to generate. The emergence of self-supervised learning (SSL) methods remove this barrier, allowing for large-scale analyses on non-annotated data. However, recent SSL approaches apply increasingly expansive model architectures and larger datasets, causing the rapid escalation of data volumes, hardware prerequisites, and overall expenses, limiting access to these resources to few institutions. Therefore, we investigated the complexity of contrastive SSL in computational pathology in relation to classification performance with the utilization of consumer-grade hardware. Specifically, we analyzed the effects of adaptations in data volume, architecture, and algorithms on downstream classification tasks, emphasizing their impact on computational resources. We trained breast cancer foundation models on a large public patient cohort and validated them on various downstream classification tasks in a weakly supervised manner on two external public patient cohorts. Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%. In summary, we propose a set of adaptations which enable the utilization of SSL in computational pathology in non-resource abundant environments.
翻译:深度学习模型已成功用于从常规组织学数据中提取临床可操作见解。通常,这些模型需要临床医生进行标注,而标注数据稀缺且生成成本高昂。自监督学习方法的出现消除了这一障碍,使得对无标注数据进行大规模分析成为可能。然而,近期自监督学习方法采用日益扩展的模型架构和更大规模数据集,导致数据量、硬件需求及总体成本迅速攀升,使得这些资源仅少数机构可及。因此,我们研究了在计算病理学中,使用消费级硬件时对比自监督学习复杂度与分类性能之间的关系。具体而言,我们分析了数据量、架构和算法层面的调整对下游分类任务的影响,并重点评估其对计算资源的消耗。我们在大型公开患者队列上训练了乳腺癌基础模型,并在两个外部公开患者队列上以弱监督方式验证了其在多种下游分类任务中的表现。实验表明,我们能够在将自监督训练时长减少90%的同时提升下游分类性能。总之,我们提出了一系列调整策略,使得自监督学习在非资源富集环境中仍能应用于计算病理学。