Foundation models are trained on massive amounts of data to distinguish complex patterns and can be adapted to a wide range of downstream tasks with minimal computational resources. Here, we develop a foundation model for prostate cancer digital pathology called HistoEncoder by pre-training on 48 million prostate tissue tile images. We demonstrate that HistoEncoder features extracted from tile images with similar histological patterns map closely together in the feature space. HistoEncoder outperforms models pre-trained with natural images, even without fine-tuning or with 1000 times less training data. We describe two use cases that leverage the capabilities of HistoEncoder by fine-tuning the model with a limited amount of data and computational resources. First, we show how HistoEncoder can be used to automatically annotate large-scale datasets with high accuracy. Second, we combine histomics with commonly used clinical nomograms, significantly improving prostate cancer-specific death survival models. Foundation models such as HistoEncoder can allow organizations with limited resources to build effective clinical software tools without needing extensive datasets or significant amounts of computing.
翻译:基础模型通过海量数据训练以识别复杂模式,并能以最小计算资源适配广泛的下游任务。本文通过4800万张前列腺组织切片图像的预训练,开发了一种名为HistoEncoder的前列腺癌数字病理学基础模型。我们证明,从具有相似组织学模式的切片图像中提取的HistoEncoder特征在特征空间中紧密映射。即使未经微调或使用少1000倍的训练数据,HistoEncoder仍优于基于自然图像预训练的模型。我们描述了两种利用HistoEncoder能力的应用场景:通过有限数据和计算资源对模型进行微调。首先,我们展示HistoEncoder如何以高精度自动标注大规模数据集。其次,我们将组织学特征与常用临床列线图相结合,显著提升了前列腺癌特异性死亡生存模型的性能。诸如HistoEncoder的基础模型可使资源有限的机构无需大量数据集或巨额计算资源,即可构建有效的临床软件工具。