Tabular biomedical data poses challenges in machine learning because it is often high-dimensional and typically low-sample-size. Previous research has attempted to address these challenges via feature selection approaches, which can lead to unstable performance on real-world data. This suggests that current methods lack appropriate inductive biases that capture patterns common to different samples. In this paper, we propose ProtoGate, a prototype-based neural model that introduces an inductive bias by attending to both homogeneity and heterogeneity across samples. ProtoGate selects features in a global-to-local manner and leverages them to produce explainable predictions via an interpretable prototype-based model. We conduct comprehensive experiments to evaluate the performance of ProtoGate on synthetic and real-world datasets. Our results show that exploiting the homogeneous and heterogeneous patterns in the data can improve prediction accuracy while prototypes imbue interpretability.
翻译:表格生物医学数据在机器学习中面临挑战,因其常呈现高维且低样本量特征。先前研究试图通过特征选择方法应对这些挑战,但在实际数据上易出现性能不稳定现象。这表明现有方法缺乏能捕捉不同样本间共有模式的恰当归纳偏置。本文提出ProtoGate,一种基于原型的神经模型,通过关注样本间的同质性与异质性引入归纳偏置。该模型以全局到局部的方式选择特征,并借助可解释的原型模型生成可解释预测。我们在合成数据集和真实数据集上开展综合实验评估ProtoGate性能,结果表明利用数据中的同质与异质模式可提升预测准确率,同时原型机制赋予模型可解释性。