Probabilistic prediction aims to compute predictive distributions rather than single-point predictions. These distributions enable practitioners to quantify uncertainty, compute risk, and detect outliers. However, most probabilistic methods assume parametric responses, such as Gaussian or Poisson distributions. When these assumptions fail, such models lead to bad predictions and poorly calibrated uncertainty. In this paper, we propose Treeffuser, an easy-to-use method for probabilistic prediction on tabular data. The idea is to learn a conditional diffusion model where the score function is estimated using gradient-boosted trees. The conditional diffusion model makes Treeffuser flexible and non-parametric, while the gradient-boosted trees make it robust and easy to train on CPUs. Treeffuser learns well-calibrated predictive distributions and can handle a wide range of regression tasks -- including those with multivariate, multimodal, and skewed responses. % , as well as categorical predictors and missing data We study Treeffuser on synthetic and real data and show that it outperforms existing methods, providing better-calibrated probabilistic predictions. We further demonstrate its versatility with an application to inventory allocation under uncertainty using sales data from Walmart. We implement Treeffuser in \href{https://github.com/blei-lab/treeffuser}{https://github.com/blei-lab/treeffuser}.
翻译:概率预测旨在计算预测分布而非单点预测值。此类分布使实践者能够量化不确定性、计算风险并检测异常值。然而,大多数概率方法假设参数化响应,例如高斯分布或泊松分布。当这些假设不成立时,此类模型会导致预测效果不佳且不确定性校准不良。本文提出Treeffuser——一种适用于表格数据概率预测的易用方法。其核心思想是学习一个条件扩散模型,其中评分函数通过梯度提升树进行估计。条件扩散模型使Treeffuser具备灵活性和非参数特性,而梯度提升树则使其具有鲁棒性且易于在CPU上训练。Treeffuser能够学习校准良好的预测分布,并可处理广泛的回归任务——包括具有多变量、多峰态和偏态响应的场景。我们通过合成数据与真实数据对Treeffuser进行研究,结果表明其性能优于现有方法,能提供校准更优的概率预测。我们进一步通过沃尔玛销售数据在不确定性下的库存分配应用展示了其多功能性。Treeffuser实现代码已发布于\href{https://github.com/blei-lab/treeffuser}{https://github.com/blei-lab/treeffuser}。