The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.
翻译:基于深度多示例学习模型对千兆像素级组织病理图像进行分类,已成为数字病理学与精准医学中的关键任务。本研究提出一种基于Transformer的多示例学习方法,该方法采用受Vision Transformer启发的区域自注意力机制,替代传统学习型注意力机制。我们提出一种融合区域块信息以生成切片级预测的方法,并展示了如何通过堆叠这种区域聚合机制,在不同距离层级上层次化处理特征。为提升预测精度(尤其针对具有微小局部形态特征的图像数据集),我们引入一种在推理过程中聚焦于高注意力区域进行图像处理的方法。该方法在两个组织病理学数据集上显著提升了基线性能,并为进一步研究指明了有前景的方向。