Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make employing deep learning methods for WSI-based digital diagnosis challenging. Recently integrating multiple instance learning (MIL) and Transformer for WSI analysis shows very promising results. However, designing effective Transformers for this weakly-supervised high-resolution image analysis is an underexplored yet important problem. In this paper, we propose a Multi-level MIL (MMIL) scheme by introducing a hierarchical structure to MIL, which enables efficient handling of MIL tasks involving a large number of instances. Based on MMIL, we instantiated MMIL-Transformer, an efficient Transformer model with windowed exact self-attention for large-scale MIL tasks. To validate its effectiveness, we conducted a set of experiments on WSI classification tasks, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods, i.e., 96.80% test AUC and 97.67% test accuracy on the CAMELYON16 dataset, 99.04% test AUC and 94.37% test accuracy on the TCGA-NSCLC dataset, respectively. All code and pre-trained models are available at: https://github.com/hustvl/MMIL-Transformer
翻译:全切片图像(WSI)是一种高分辨率扫描组织图像,在计算机辅助诊断(CAD)中广泛应用。极高的分辨率和区域级标注的稀缺性使得基于深度学习的WSI数字诊断方法面临挑战。近年来,将多实例学习(MIL)与Transformer相结合用于WSI分析展现出极具前景的结果。然而,针对这种弱监督高分辨率图像分析任务设计有效的Transformer仍是一个尚未充分探索但至关重要的问题。本文通过引入层级结构到MIL中,提出了一种多层级MIL(MMIL)方案,能够高效处理涉及大量实例的MIL任务。基于MMIL,我们实例化了MMIL-Transformer——一种用于大规模MIL任务的高效带窗口精确自注意力Transformer模型。为验证其有效性,我们在WSI分类任务上开展了一系列实验,MMIL-Transformer相较于现有最先进方法展现出优越性能,即在CAMELYON16数据集上达到96.80%的测试AUC和97.67%的测试准确率,在TCGA-NSCLC数据集上分别达到99.04%的测试AUC和94.37%的测试准确率。所有代码和预训练模型均可在https://github.com/hustvl/MMIL-Transformer获取。