Real life machine learning problems exhibit distributional shifts in the data from one time to another or from on place to another. This behavior is beyond the scope of the traditional empirical risk minimization paradigm, which assumes i.i.d. distribution of data over time and across locations. The emerging field of out-of-distribution (OOD) generalization addresses this reality with new theory and algorithms which incorporate environmental, or era-wise information into the algorithms. So far, most research has been focused on linear models and/or neural networks. In this research we develop two new splitting criteria for decision trees, which allow us to apply ideas from OOD generalization research to decision tree models, including random forest and gradient-boosting decision trees. The new splitting criteria use era-wise information associated with each data point to allow tree-based models to find split points that are optimal across all disjoint eras in the data, instead of optimal over the entire data set pooled together, which is the default setting. We describe the new splitting criteria in detail and develop unique experiments to showcase the benefits of these new criteria, which improve metrics in our experiments out-of-sample. The new criteria are incorporated into the a state-of-the-art gradient boosted decision tree model in the Scikit-Learn code base, which is made freely available.
翻译:现实生活中的机器学习问题中,数据在不同时间或不同地点间存在分布偏移。这一现象超越了传统经验风险最小化范式的适用范围,该范式假设数据在时间和空间上独立同分布。新兴的分布外泛化领域通过将环境或时代信息融入算法,提出了新的理论与方法以应对这一现实。目前,多数研究集中于线性模型和/或神经网络。本文针对决策树提出了两种新的分裂准则,从而将分布外泛化研究的思路应用于决策树模型,包括随机森林和梯度提升决策树。新准则利用每个数据点关联的时代信息,使基于树的模型能够找到在所有不相交时代的数据间均最优的分裂点,而非仅在全数据集合并后的默认设置中寻求最优。我们详细阐述了新分裂准则,并设计了独特实验以展示其优势——在样本外实验中提升了指标表现。新准则已集成至Scikit-Learn代码库中的前沿梯度提升决策树模型,并免费开源供使用。