Ensembling is among the most popular tools in machine learning (ML) due to its effectiveness in minimizing variance and thus improving generalization. Most ensembling methods for black-box base learners fall under the umbrella of "stacked generalization," namely training an ML algorithm that takes the inferences from the base learners as input. While stacking has been widely applied in practice, its theoretical properties are poorly understood. In this paper, we prove a novel result, showing that choosing the best stacked generalization from a (finite or finite-dimensional) family of stacked generalizations based on cross-validated performance does not perform "much worse" than the oracle best. Our result strengthens and significantly extends the results in Van der Laan et al. (2007). Inspired by the theoretical analysis, we further propose a particular family of stacked generalizations in the context of probabilistic forecasting, each one with a different sensitivity for how much the ensemble weights are allowed to vary across items, timestamps in the forecast horizon, and quantiles. Experimental results demonstrate the performance gain of the proposed method.
翻译:集成学习是机器学习中最常用的工具之一,其通过有效最小化方差来提升泛化能力。大多数针对黑箱基学习器的集成方法都属于"堆叠泛化"范畴,即训练一个将基学习器输出作为输入的机器学习算法。尽管堆叠法已在实践中广泛应用,但其理论性质仍不明确。本文证明了一个新颖结论:基于交叉验证性能从(有限或有限维)堆叠泛化族中选择最优结果,其表现不会"显著差于"理论最优解。该结论强化并显著扩展了Van der Laan等人(2007)的研究成果。受理论分析启发,我们进一步在概率预测场景中提出一类特定堆叠泛化方法,该方法对集成权重跨项目、预测时间点及分位数允许变化程度设置了不同灵敏度。实验结果表明了所提方法的性能提升。