Most implementations of Bayesian additive regression trees (BART) one-hot encode categorical predictors, replacing each one with several binary indicators, one for every level or category. Regression trees built with these indicators partition the discrete set of categorical levels by repeatedly removing one level at a time. Unfortunately, the vast majority of partitions cannot be built with this strategy, severely limiting BART's ability to partially pool data across groups of levels. Motivated by analyses of baseball data and neighborhood-level crime dynamics, we overcame this limitation by re-implementing BART with regression trees that can assign multiple levels to both branches of a decision tree node. To model spatial data aggregated into small regions, we further proposed a new decision rule prior that creates spatially contiguous regions by deleting a random edge from a random spanning tree of a suitably defined network. Our re-implementation, which is available in the flexBART package, often yields improved out-of-sample predictive performance and scales better to larger datasets than existing implementations of BART.
翻译:大多数贝叶斯加性回归树(BART)的实现采用独热编码处理分类预测变量,将每个分类变量替换为多个二元指示变量,每个类别对应一个指示变量。基于这些指示变量构建的回归树通过每次移除单一类别来分割离散的分类层次集。遗憾的是,这种策略无法构建绝大多数可能的分割方式,严重限制了BART在不同层次组间进行部分数据整合的能力。受棒球数据分析与邻里层面犯罪动态研究的启发,我们通过重新设计BART算法克服了这一局限——使回归树能够将多个类别同时分配到决策树节点的两个分支中。针对小区域聚合空间数据的建模需求,我们进一步提出新的决策规则先验,通过从适当定义网络的随机生成树中删除随机边来创建空间连续区域。我们提出的重新实现方案(已在flexBART包中发布)相较于现有BART实现,通常能获得更优的样本外预测性能,并且在大数据集上具有更强的可扩展性。