Bayesian Decision Trees via Tractable Priors and Probabilistic Context-Free Grammars

Decision Trees are some of the most popular machine learning models today due to their out-of-the-box performance and interpretability. Often, Decision Trees models are constructed greedily in a top-down fashion via heuristic search criteria, such as Gini impurity or entropy. However, trees constructed in this manner are sensitive to minor fluctuations in training data and are prone to overfitting. In contrast, Bayesian approaches to tree construction formulate the selection process as a posterior inference problem; such approaches are more stable and provide greater theoretical guarantees. However, generating Bayesian Decision Trees usually requires sampling from complex, multimodal posterior distributions. Current Markov Chain Monte Carlo-based approaches for sampling Bayesian Decision Trees are prone to mode collapse and long mixing times, which makes them impractical. In this paper, we propose a new criterion for training Bayesian Decision Trees. Our criterion gives rise to BCART-PCFG, which can efficiently sample decision trees from a posterior distribution across trees given the data and find the maximum a posteriori (MAP) tree. Learning the posterior and training the sampler can be done in time that is polynomial in the dataset size. Once the posterior has been learned, trees can be sampled efficiently (linearly in the number of nodes). At the core of our method is a reduction of sampling the posterior to sampling a derivation from a probabilistic context-free grammar. We find that trees sampled via BCART-PCFG perform comparable to or better than greedily-constructed Decision Trees in classification accuracy on several datasets. Additionally, the trees sampled via BCART-PCFG are significantly smaller -- sometimes by as much as 20x.

翻译：决策树是当今最流行的机器学习模型之一，因其开箱即用的性能和可解释性而广受青睐。通常，决策树模型通过启发式搜索准则（如基尼不纯度或信息熵）以自顶向下的贪心方式构建。然而，以这种方式构建的树对训练数据中的微小波动十分敏感，并且容易过拟合。相比之下，贝叶斯方法将树构建过程表述为后验推断问题，这类方法更加稳定，且能提供更强的理论保证。然而，生成贝叶斯决策树通常需要从复杂、多峰的后验分布中采样。当前基于马尔可夫链蒙特卡洛的贝叶斯决策树采样方法容易陷入模式坍塌和混合时间过长的问题，这使得它们不切实际。在本文中，我们提出了一种训练贝叶斯决策树的新准则。该准则催生了BCART-PCFG，它能高效地从给定数据条件下所有树的后验分布中采样决策树，并找到最大后验（MAP）树。学习后验分布和训练采样器的时间复杂度与数据集大小呈多项式关系。一旦后验分布被学习完成，树的采样可以高效进行（与节点数呈线性关系）。我们方法的核心是将后验采样问题转化为从概率上下文无关文法中推导出样本的问题。我们发现在多个数据集上，通过BCART-PCFG采样的树在分类准确率方面与贪心构建的决策树表现相当或更优。此外，通过BCART-PCFG采样的树体积显著更小——有时甚至小20倍。