MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification

Bag-based Multiple Instance Learning (MIL) approaches have emerged as the mainstream methodology for Whole Slide Image (WSI) classification. However, most existing methods adopt a segmented training strategy, which first extracts features using a pre-trained feature extractor and then aggregates these features through MIL. This segmented training approach leads to insufficient collaborative optimization between the feature extraction network and the MIL network, preventing end-to-end joint optimization and thereby limiting the overall performance of the model. Additionally, conventional methods typically extract features from all patches of fixed size, ignoring the multi-scale observation characteristics of pathologists. This not only results in significant computational resource waste when tumor regions represent a minimal proportion (as in the Camelyon16 dataset) but may also lead the model to suboptimal solutions. To address these limitations, this paper proposes an end-to-end multi-scale WSI classification framework that integrates multi-scale feature extraction with multiple instance learning. Specifically, our approach includes: (1) a semantic feature filtering module to reduce interference from non-lesion areas; (2) a multi-scale feature extraction module to capture pathological information at different levels; and (3) a multi-scale fusion MIL module for global modeling and feature integration. Through an end-to-end training strategy, we simultaneously optimize both the feature extractor and MIL network, ensuring maximum compatibility between them. Experiments were conducted on three cross-center datasets (DigestPath2019, BCNB, and UBC-OCEAN). Results demonstrate that our proposed method outperforms existing state-of-the-art approaches in terms of both accuracy (ACC) and AUC metrics.

翻译：基于包的多示例学习（MIL）方法已成为全切片图像（WSI）分类的主流方法。然而，现有方法大多采用分段式训练策略，即首先使用预训练的特征提取器提取特征，然后通过MIL聚合这些特征。这种分段式训练方法导致特征提取网络与MIL网络之间协同优化不足，阻碍了端到端的联合优化，从而限制了模型的整体性能。此外，传统方法通常从所有固定尺寸的图像块中提取特征，忽略了病理学家的多尺度观察特性。这不仅在肿瘤区域占比极小（如Camelyon16数据集）时造成大量计算资源浪费，也可能导致模型陷入次优解。为应对这些局限，本文提出了一种端到端的全切片图像多尺度分类框架，将多尺度特征提取与多示例学习相融合。具体而言，我们的方法包括：（1）语义特征过滤模块，以减少非病变区域的干扰；（2）多尺度特征提取模块，以捕获不同层次的病理信息；（3）多尺度融合MIL模块，用于全局建模与特征整合。通过端到端的训练策略，我们同时优化特征提取器与MIL网络，确保二者之间的最大兼容性。我们在三个跨中心数据集（DigestPath2019、BCNB和UBC-OCEAN）上进行了实验。结果表明，我们提出的方法在准确率（ACC）和AUC指标上均优于现有的先进方法。