Distribution regression, where the goal is to predict a scalar response from a distribution-valued predictor, arises naturally in settings where observations are grouped and outcomes depend on group-level characteristics rather than on individual measurements. We introduce DistBART, a Bayesian nonparametric approach to distribution regression that models the regression function as a linear functional with the Riesz representer assigned a Bayesian additive regression trees (BART) prior. We argue that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions. We show this both empirically on synthetic and real data and theoretically through an adaptive posterior concentration result. We also establish connections to kernel methods, and use this connection to motivate variants of DistBART that can learn nonlinear functionals. To enable scalability to large datasets, we develop a random-feature approximation that samples trees from the BART prior and reduces inference to sparse Bayesian linear regression, achieving computational efficiency while retaining uncertainty quantification.
翻译:分布回归旨在根据分布值预测变量预测标量响应,当观测数据分组且结果依赖于组级特征而非个体测量时,该问题自然产生。本文提出DistBART,一种用于分布回归的贝叶斯非参数方法,其将回归函数建模为线性泛函,并将Riesz表示元赋予贝叶斯加性回归树(BART)先验。我们认为浅层决策树集成能够为表格数据编码合理的归纳偏置,这使其适用于泛函主要依赖于分布低维边际的场景。我们通过合成数据与真实数据的实证分析,以及自适应后验集中性的理论结果共同验证了这一观点。同时,我们建立了与核方法的理论关联,并基于此关联提出了能够学习非线性泛函的DistBART变体。为实现大规模数据集的可扩展性,我们开发了随机特征近似方法:通过从BART先验中采样树结构,将推断问题转化为稀疏贝叶斯线性回归,在保持不确定性量化的同时显著提升计算效率。