Whole slide imaging is fundamental to biomedical microscopy and computational pathology. However, whole slide images (WSIs) present a complex computer vision challenge due to their gigapixel size, diverse histopathologic features, spatial heterogeneity, and limited/absent data annotations. These challenges highlight that supervised training alone can result in suboptimal whole slide representations. Self-supervised representation learning can achieve high-quality WSI visual feature learning for downstream diagnostic tasks, such as cancer diagnosis or molecular genetic prediction. Here, we present a general self-supervised whole slide learning (S3L) framework for gigapixel-scale self-supervision of WSIs. S3L combines data transformation strategies from transformer-based vision and language modeling into a single unified framework to generate paired views for self-supervision. S3L leverages the inherent regional heterogeneity, histologic feature variability, and information redundancy within WSIs to learn high-quality whole-slide representations. We benchmark S3L visual representations on two diagnostic tasks for two biomedical microscopy modalities. S3L significantly outperforms WSI baselines for cancer diagnosis and genetic mutation prediction. Additionally, S3L achieves good performance using both in-domain and out-of-distribution patch encoders, demonstrating good flexibility and generalizability.
翻译:全切片成像对于生物医学显微镜和计算病理学至关重要。然而,全切片图像由于其千兆像素尺度、多样的组织病理学特征、空间异质性以及有限或缺失的数据标注,构成了复杂的计算机视觉挑战。这些挑战表明,仅靠监督训练可能导致次优的全切片表示。自监督表示学习能够为下游诊断任务(如癌症诊断或分子遗传学预测)实现高质量的全切片视觉特征学习。在此,我们提出了一种通用的自监督全切片学习框架,用于全切片的千兆像素级自监督学习。该框架将基于transformer的视觉和语言建模中的数据变换策略整合到一个统一框架中,以生成用于自监督的配对视图。该框架充分利用全切片图像内部固有的区域异质性、组织学特征变异性和信息冗余性,学习高质量的全切片表示。我们在两种生物医学显微镜模态的两个诊断任务上对框架的视觉表示进行了基准测试。该框架在癌症诊断和基因突变预测方面显著优于全切片基线方法。此外,该框架在使用域内和域外补丁编码器时均能取得良好性能,展现出良好的灵活性和泛化能力。