BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular Bayesian regression models widely applicable in modern regression problems. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. This ability to quantify uncertainties is key, as it allows researchers to perform appropriate inferential analyses in settings that have generally been too difficult to handle using the Bayesian approach. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we will consider influential observations, which one reasonably would imagine to be common -- or at least a concern -- in the big-data setting. In this paper, we consider both the problem of detecting influential observations and adjusting predictions to not be unduly affected by such potentially problematic data. We consider three detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the others taking the form of a divergence measure and a conditional predictive density metric, and then propose an importance sampling algorithm to re-weight previously sampled posterior draws so as to remove the effects of influential data in a computationally efficient manner. Finally, our methods are demonstrated on real-world data where blind application of the models can lead to poor predictions and inference.
翻译:BCART(贝叶斯分类与回归树)和BART(贝叶斯加性回归树)是现代回归问题中广泛应用的热门贝叶斯回归模型。它们之所以广受欢迎,与其能够灵活建模依赖于高维输入的复杂响应变量,同时又能量化不确定性密切相关。这种量化不确定性的能力至关重要,因为它使研究人员能够在以往因过于困难而难以用贝叶斯方法处理的场景中进行恰当的推断分析。然而,令人惊讶的是,针对这些现代回归模型对建模假设违反情况的敏感性评估工作却鲜有开展。具体而言,我们将重点考察影响观测点——在大数据情境下,人们有理由认为这类数据普遍存在(或至少值得关注)。本文既考虑检测影响观测点的问题,也考虑调整预测结果使其不受此类潜在问题数据的过度影响。我们针对贝叶斯树模型提出了三种检测诊断方法:一种是Cook距离的类比方法,另两种分别采用散度度量和条件预测密度指标的形式;随后我们提出一种重要性采样算法,通过对先前抽取的后验样本重新加权,以计算高效的方式消除影响数据的作用。最后,我们在真实数据上验证了所提方法——这些数据若盲目应用模型可能导致较差的预测和推断效果。