Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

Prior-Data Fitted Networks (PFNs), such as TabPFN and TabICL, have revolutionized tabular deep learning by leveraging in-context learning for tabular data. These models are meant as foundation models for classification and regression settings and promise to greatly simplify deployment in practical settings because their performance is unprecedented (in terms of mean squared error or $R^2$, when measured on common benchmarks like TabArena or TALENT). However, we see an important weakness of current benchmarks for the regression setting: the current benchmarks focus on evaluating win rates and performance using metrics like (root) mean squared error or $R^2$. Therefore, these leaderboards (implicitly and explicitly) push researchers to optimize for machine learning pipelines which elicit a good mean value estimate. The main problem is that this approach only evaluates a point estimate (namely the mean estimator which is the Bayes estimator associated with the mean squared error loss). In this article we discuss the application of proper scoring rules for evaluating the goodness of probabilistic forecasts in distributional regression. We also propose to enhance common machine learning benchmarks with metrics for probabilistic regression. To improve the status quo and make the machine learning community aware of scoring rules for probabilistic regression, we advocate to use the continuous ranked probability score (CRPS) in benchmarks for probabilistic regression. However, we also illustrate that the choice of the scoring rule changes the inductive bias of the trained model. We, therefore, advocate for finetuning or promptable tabular foundation models.

翻译：先验数据拟合网络（PFNs），如TabPFN和TabICL，通过利用表格数据的上下文学习彻底改变了表格深度学习。这些模型旨在成为分类和回归场景的基础模型，并有望极大简化实际部署，因为它们在常用基准测试（如TabArena或TALENT）上以均方误差或$R^2$衡量的性能达到了前所未有的水平。然而，我们发现当前回归场景基准测试存在一个重要缺陷：现有基准主要关注使用（均方根）误差或$R^2$等指标评估胜率和性能。因此，这些排行榜（无论隐含或明确地）推动研究者优化能够产生良好均值估计的机器学习流程。核心问题在于，这种方法仅评估点估计（即与均方误差损失相关的贝叶斯估计量——均值估计量）。本文探讨了在分布回归中应用适当评分规则评估概率预测质量的方法。我们同时提议在常见机器学习基准测试中增强概率回归的评估指标。为改善现状并提升机器学习社区对概率回归评分规则的认识，我们主张在概率回归基准测试中采用连续分级概率评分（CRPS）。然而，我们也论证了评分规则的选择会改变训练模型的归纳偏好。因此，我们倡导对可微调或可提示的表格基础模型进行优化。