We create a mixed-integer optimization (MIO) approach for doing cluster-aware regression, i.e. linear regression that takes into account the inherent clustered structure of the data. We compare to the linear mixed effects regression (LMEM) which is the most used current method, and design simulation experiments to show superior performance to LMEM in terms of both predictive and inferential metrics in silico. Furthermore, we show how our method is formulated in a very interpretable way; LMEM cannot generalize and make cluster-informed predictions when the cluster of new data points is unknown, but we solve this problem by training an interpretable classification tree that can help decide cluster effects for new data points, and demonstrate the power of this generalizability on a real protein expression dataset.
翻译:我们提出了一种基于混合整数优化(MIO)的聚类感知回归方法,即能够感知数据固有聚类结构的线性回归。通过与当前主流方法——线性混合效应回归(LMEM)进行对比,并设计仿真实验,我们证明该方法在预测性能和推断指标上均优于LMEM。此外,我们的方法具有高度可解释性:当新数据点的聚类归属未知时,LMEM无法进行泛化并做出包含聚类信息的预测,而我们的方法通过训练一个可解释的分类树来解决此问题,该分类树能够为新数据点确定聚类效应。我们通过真实的蛋白质表达数据集展示了这种泛化能力的有效性。