Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make it challenging to employ deep learning methods for WSI-based digital diagnosis. Multiple instance learning (MIL) is a powerful tool to address the weak annotation problem, while Transformer has shown great success in the field of visual tasks. The combination of both should provide new insights for deep learning based image diagnosis. However, due to the limitations of single-level MIL and the attention mechanism's constraints on sequence length, directly applying Transformer to WSI-based MIL tasks is not practical. To tackle this issue, we propose a Multi-level MIL with Transformer (MMIL-Transformer) approach. By introducing a hierarchical structure to MIL, this approach enables efficient handling of MIL tasks that involve a large number of instances. To validate its effectiveness, we conducted a set of experiments on WSIs classification task, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods. Our proposed approach achieves test AUC 94.74% and test accuracy 93.41% on CAMELYON16 dataset, test AUC 99.04% and test accuracy 94.37% on TCGA-NSCLC dataset, respectively. All code and pre-trained models are available at: https://github.com/hustvl/MMIL-Transformer
翻译:全切片图像(WSI)是一种高分辨率扫描组织图像,广泛应用于计算机辅助诊断(CAD)。由于极高的分辨率和区域级标注的有限可用性,基于深度学习的WSI数字诊断面临挑战。多实例学习(MIL)是解决弱标注问题的有力工具,而Transformer在视觉任务领域已取得巨大成功。二者的结合应为基于深度学习的图像诊断提供新思路。然而,由于单层级MIL的局限性以及注意力机制对序列长度的限制,直接应用Transformer处理基于WSI的MIL任务并不现实。为解决此问题,我们提出了一种基于Transformer的多层级多实例学习方法(MMIL-Transformer)。通过为MIL引入层级结构,该方法能够高效处理涉及大量实例的MIL任务。为验证其有效性,我们在WSI分类任务上进行了一系列实验,结果表明MMIL-Transformer相较于现有最先进方法展现出更优性能。所提方法在CAMELYON16数据集上实现了测试AUC 94.74%和测试准确率93.41%,在TCGA-NSCLC数据集上分别实现了测试AUC 99.04%和测试准确率94.37%。所有代码及预训练模型均可在https://github.com/hustvl/MMIL-Transformer获取。