The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.
翻译:利用深度多实例学习模型对千兆像素级组织病理学图像进行分类,已成为数字病理学和精准医学中的关键任务。本研究提出一种基于Transformer的多实例学习方法,该方法用区域化、受Vision Transformer启发的自注意力机制替代传统学习型注意力机制。我们提出一种融合区域补丁信息以推导切片级预测的方法,并展示了如何通过堆叠这种区域聚合来分层处理不同距离级别上的特征。为提升预测准确性(尤其是针对具有细微局部形态特征的数据集),我们在推理阶段引入了一种将图像处理聚焦于高注意力区域的方法。在两个组织病理学数据集上,本方法显著优于基线模型,并为后续研究指明了有前景的方向。