Influential Observations in Bayesian Regression Tree Models

BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular Bayesian regression models widely applicable in modern regression problems. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. This ability to quantify uncertainties is key, as it allows researchers to perform appropriate inferential analyses in settings that have generally been too difficult to handle using the Bayesian approach. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we will consider influential observations, which one reasonably would imagine to be common -- or at least a concern -- in the big-data setting. In this paper, we consider both the problem of detecting influential observations and adjusting predictions to not be unduly affected by such potentially problematic data. We consider three detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the others taking the form of a divergence measure and a conditional predictive density metric, and then propose an importance sampling algorithm to re-weight previously sampled posterior draws so as to remove the effects of influential data in a computationally efficient manner. Finally, our methods are demonstrated on real-world data where blind application of the models can lead to poor predictions and inference.

翻译：BCART（贝叶斯分类与回归树）和BART（贝叶斯加性回归树）是现代回归问题中广泛应用的热门贝叶斯回归模型。它们之所以广受欢迎，与其能够灵活建模依赖于高维输入的复杂响应变量，同时又能量化不确定性密切相关。这种量化不确定性的能力至关重要，因为它使研究人员能够在以往因过于困难而难以用贝叶斯方法处理的场景中进行恰当的推断分析。然而，令人惊讶的是，针对这些现代回归模型对建模假设违反情况的敏感性评估工作却鲜有开展。具体而言，我们将重点考察影响观测点——在大数据情境下，人们有理由认为这类数据普遍存在（或至少值得关注）。本文既考虑检测影响观测点的问题，也考虑调整预测结果使其不受此类潜在问题数据的过度影响。我们针对贝叶斯树模型提出了三种检测诊断方法：一种是Cook距离的类比方法，另两种分别采用散度度量和条件预测密度指标的形式；随后我们提出一种重要性采样算法，通过对先前抽取的后验样本重新加权，以计算高效的方式消除影响数据的作用。最后，我们在真实数据上验证了所提方法——这些数据若盲目应用模型可能导致较差的预测和推断效果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日