Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.
翻译:对比式视觉语言预训练已成为一种强大的方法,可用于训练新的语言感知图像编码器,或增强现有预训练模型的零样本视觉识别能力。然而,现有研究通常在大规模图像-文本对数据集上训练,且设计用于仅涉及中小尺寸图像的的下游任务,这两点均不适用于新兴的计算病理学领域——该领域公开可用的图像-文本对数据集有限,且每张图像可高达100,000×100,000像素。本文提出MI-Zero,一个简洁直观的框架,旨在将对比对齐的图像与文本模型的零样本迁移能力释放于千兆像素级组织病理学全切片图像上,使预训练编码器无需额外标签即可执行多种下游诊断任务。MI-Zero将零样本迁移重新公式化到多示例学习框架下,以克服超大规模图像推理的计算挑战。我们利用超过55万份病理报告及其他领域内文本语料库预训练文本编码器。通过有效利用强大的预训练编码器,基于超过3.3万对组织病理学图像-文本描述对训练的最佳模型,在三个不同真实世界癌症亚型分类任务中实现了70.2%的平均中位数零样本准确率。我们的代码已开源:https://github.com/mahmoodlab/MI-Zero。