In biomedical data analysis, Multiple Instance Learning (MIL) models have emerged as a powerful tool to classify patients' microscopy samples. However, the data-intensive requirement of these models poses a significant challenge in scenarios with scarce data availability, e.g., in rare diseases. We introduce a topological regularization term to MIL to mitigate this challenge. It provides a shape-preserving inductive bias that compels the encoder to maintain the essential geometrical-topological structure of input bags during projection into latent space. This enhances the performance and generalization of the MIL classifier regardless of the aggregation function, particularly for scarce training data. The effectiveness of our method is confirmed through experiments across a range of datasets, showing an average enhancement of 2.8% for MIL benchmarks, 15.3% for synthetic MIL datasets, and 5.5% for real-world biomedical datasets over the current state-of-the-art.
翻译:在生物医学数据分析中,多实例学习(MIL)模型已成为对患者显微样本进行分类的强大工具。然而,这些模型对数据的密集需求在数据稀缺(如罕见疾病)场景中构成了重大挑战。我们引入一种拓扑正则化项以缓解这一挑战。该正则化项提供了保持形状的归纳偏置,强制编码器在将输入包投影到潜在空间时维持其基本几何-拓扑结构。无论采用何种聚合函数,这都能提升MIL分类器的性能与泛化能力,尤其在训练数据稀缺时更为显著。通过跨多个数据集的实验,我们的方法有效性得到验证:与当前最优方法相比,在MIL基准数据集上平均提升2.8%,在合成MIL数据集上提升15.3%,在真实生物医学数据集上提升5.5%。