On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

翻译：贝叶斯加性回归树（BART）是一种广泛应用于因果推断等领域的流行贝叶斯非参数回归模型。其优异的预测性能得到了完善估计理论的支持，该理论保证在各种数据生成设置及适当先验选择下，其后验分布能够以最优速率集中于真实回归函数周围。然而，由Chipman等人（2010）提出的广泛使用的BART采样器的计算特性尚未得到充分理解。本文对默认BART采样器的轻微修改版本进行了渐近分析，该分析针对具有离散协变量的数据生成过程。我们证明，由于目标后验的多模态特性，采样器的收敛时间（以到达高后验密度集的命中时间衡量）随训练样本数量的增加而增加。另一方面，我们表明通过简单的修改（例如增加集成中树的数量或提高采样器的温度）可以抑制这种趋势。这些结果为存在大量训练数据时BART采样器的计算效率提供了细致入微的图景，同时提出了改进采样器的策略。我们通过针对默认BART采样器的模拟研究补充了理论分析。我们观察到，收敛时间随训练样本数增加的趋势在默认BART采样器中成立，且对采样器初始化方式、预热迭代次数、特征选择先验以及离散化策略的变化具有稳健性。另一方面，增加树的数量或提高温度能显著抑制这种趋势，这与我们的理论预测一致。