Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.
翻译:决策树依然是当今最流行的机器学习模型之一,主要原因在于其开箱即用的性能和可解释性。本文提出一种通过树的后验分布的最大后验推理进行决策树归纳的贝叶斯方法。我们首先证明决策树的最大后验推理与AND/OR搜索之间的关联。基于此关联,我们提出一种名为MAPTree的AND/OR搜索算法,该算法能够恢复最大后验树。最后,我们在合成数据和真实场景中展示最大后验树的实证性能。在16个真实世界数据集上,MAPTree或优于基线方法,或在相当性能下生成更小的树。在合成数据集上,MAPTree相比现有方法展现出更强的噪声鲁棒性和更好的泛化能力。最终,MAPTree比现有采样方法更快地恢复最大后验树,并且与这些算法不同,它能提供最优性保证。实验代码见https://github.com/ThrunGroup/maptree。