Tissue phenotyping is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts have proposed using pretrained image encoders with either transfer learning from natural image datasets or self-supervised pretraining on publicly-available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. We introduce UNI, a general-purpose self-supervised model for pathology, pretrained using over 100 million tissue patches from over 100,000 diagnostic haematoxylin and eosin-stained WSIs across 20 major tissue types, and evaluated on 33 representative CPath clinical tasks in CPath of varying diagnostic difficulties. In addition to outperforming previous state-of-the-art models, we demonstrate new modeling capabilities in CPath such as resolution-agnostic tissue classification, slide classification using few-shot class prototypes, and disease subtyping generalization in classifying up to 108 cancer types in the OncoTree code classification system. UNI advances unsupervised representation learning at scale in CPath in terms of both pretraining data and downstream evaluation, enabling data-efficient AI models that can generalize and transfer to a gamut of diagnostically-challenging tasks and clinical workflows in anatomic pathology.
翻译:组织表型分析是计算病理学中的一项基础任务,旨在学习解剖病理学中组织病理学生物标志物的客观表征。然而,全切片成像涉及复杂的计算机视觉问题:WSI的图像分辨率极高,且形态表型多样性巨大,这限制了大规模数据标注。当前研究通常采用预训练图像编码器(包括从自然图像数据集进行迁移学习,或利用公开组织病理学数据集进行自监督预训练),但尚未在多种组织类型上进行大规模开发与系统性评估。我们提出UNI——一款面向病理学的通用自监督模型。该模型基于来自20种主要组织类型的10万余张诊断性苏木精-伊红染色WSI中的逾1亿个组织补丁进行预训练,并在33项涵盖不同诊断难度的代表性CPath临床任务中接受评估。除优于此前最优模型外,UNI还展现了CPath领域的新型建模能力:分辨率无关的组织分类、基于少样本类别原型的切片分类,以及在OncoTree编码分类系统中对多达108种癌症类型进行泛化的疾病亚型分类。UNI在预训练数据规模和下游评估维度上推动了CPath领域大规模无监督表示学习的发展,使得数据高效的AI模型能够泛化并迁移至解剖病理学中一系列诊断挑战性任务与临床工作流。