Recent advancements in Mixed Integer Optimization (MIO) algorithms, paired with hardware enhancements, have led to significant speedups in resolving MIO problems. These strategies have been utilized for optimal subset selection, specifically for choosing $k$ features out of $p$ in linear regression given $n$ observations. In this paper, we broaden this method to facilitate cluster-aware regression, where selection aims to choose $\lambda$ out of $K$ clusters in a linear mixed effects (LMM) model with $n_k$ observations for each cluster. Through comprehensive testing on a multitude of synthetic and real datasets, we exhibit that our method efficiently solves problems within minutes. Through numerical experiments, we also show that the MIO approach outperforms both Gaussian- and Laplace-distributed LMMs in terms of generating sparse solutions with high predictive power. Traditional LMMs typically assume that clustering effects are independent of individual features. However, we introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.
翻译:混合整数优化(MIO)算法的最新进展,结合硬件性能提升,显著加快了MIO问题的求解速度。这些策略已被用于最优子集选择,具体而言是在给定$n$个观测值的线性回归中,从$p$个特征中选出$k$个。本文将该方法拓展至聚类感知回归场景,其中选择目标为:在线性混合效应(LMM)模型中,每个聚类包含$n_k$个观测值,从$K$个聚类中选出$\lambda$个。通过对多个合成数据集与真实数据集的全面测试,我们证明该方法能在数分钟内高效解决问题。数值实验还表明,MIO方法在生成具有高预测能力的稀疏解方面优于高斯分布和拉普拉斯分布的LMM。传统LMM通常假设聚类效应与个体特征相互独立,而本文提出一种创新算法,可评估新数据点的聚类效应,从而增强模型的稳健性与精确性。通过将该方法应用于学生评分与蛋白质表达场景,进一步展示了其在推断与预测方面的有效性。