Ensembling is among the most popular tools in machine learning (ML) due to its effectiveness in minimizing variance and thus improving generalization. Most ensembling methods for black-box base learners fall under the umbrella of "stacked generalization," namely training an ML algorithm that takes the inferences from the base learners as input. While stacking has been widely applied in practice, its theoretical properties are poorly understood. In this paper, we prove a novel result, showing that choosing the best stacked generalization from a (finite or finite-dimensional) family of stacked generalizations based on cross-validated performance does not perform "much worse" than the oracle best. Our result strengthens and significantly extends the results in Van der Laan et al. (2007). Inspired by the theoretical analysis, we further propose a particular family of stacked generalizations in the context of probabilistic forecasting, each one with a different sensitivity for how much the ensemble weights are allowed to vary across items, timestamps in the forecast horizon, and quantiles. Experimental results demonstrate the performance gain of the proposed method.
翻译:在机器学习中,集成(Ensembling)因其有效降低方差并提升泛化能力而成为最流行的工具之一。针对黑箱基学习器,多数集成方法可归入"堆叠泛化"(Stacked Generalization)范畴——即以基学习器输出为输入来训练机器学习算法。尽管堆叠方法在实践中应用广泛,但其理论性质仍不明确。本文证明了一个新结论:从(有限或有限维)堆叠泛化族中基于交叉验证性能选取最优堆叠泛化时,其表现不会"显著劣于"理论最优值。该结论强化并显著拓展了Van der Laan等(2007)的研究成果。受理论分析的启发,我们进一步在概率预测场景下提出一类特定的堆叠泛化模型,其对集成权重在不同物品、预测时域时间戳及分位数间的变化敏感度各异。实验结果表明了所提方法的性能优势。