Sparse methods (e.g., Best Subset Selection, Elastic Net) are the standard approach for obtaining interpretable models, but they can suffer from high variance and vulnerability to spurious correlations. Alternatively, algorithmic ensembles (e.g., Random Forests, Gradient Boosting) achieve high prediction accuracy but yield uninterpretable black boxes driven by randomization or sequential residual fitting. In recent years, a unifying paradigm has emerged: Objective-Driven Ensembles. By generalizing best subset selection into a joint mathematical optimization problem, this approach generates interpretable ensembles by optimally splitting predictors across a small number of diverse models. In this paper, we synthesize this growing body of literature and illustrate the statistical principles driving its empirical success. Specifically, we utilize finite-sample bounds to demonstrate how penalizing predictor overlap controls ensemble covariance and provides a mathematical hedge against spurious correlations. We evaluate these mechanics using an exact combinatorial oracle, and review how recent computational approximations have successfully scaled this framework to a variety of domains, including high-dimensional data, classification tasks, and settings with casewise or cellwise contamination, achieving machine-learning-level accuracy while retaining the interpretability of sparse models.
翻译:稀疏方法(如最优子集选择、弹性网络)是获取可解释模型的标准方法,但会遭受高方差及易受虚假相关影响的缺陷。相比之下,算法集成方法(如随机森林、梯度提升)虽能达到高预测精度,却因依赖随机化或序贯残差拟合而产生难以解释的黑箱模型。近年来,一种统一的范式——目标驱动集成——应运而生。该方法将最优子集选择推广为联合数学优化问题,通过将预测变量最优地分配到少量差异化模型中,构建可解释的集成模型。本文综合梳理了这一日益增长的文献体系,并阐释其经验成功背后的统计原理。具体而言,我们利用有限样本界证明,惩罚预测变量重叠如何控制集成协方差,并提供抵御虚假相关的数学对冲机制。通过精确组合优化器评估这些机制后,我们综述了近期计算近似方法如何成功将该框架扩展到高维数据、分类任务以及存在个案或单元污染的场景,最终在保持稀疏模型可解释性的同时达到机器学习级别的精度。