ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet prevailing regression benchmarks evaluate them almost exclusively via point-estimate metrics (RMSE, $R^2$). This discards precisely the distributional information these models are designed to provide - a critical gap for high-stakes domains where not all kinds of errors are equally costly. We introduce ScoringBench, an open and extensible benchmark that evaluates tabular regression models under a comprehensive suite of proper scoring rules - including CRPS, CRLS, interval score, energy score, and weighted CRPS - alongside standard point metrics. ScoringBench covers 97 regression datasets from diverse domains, supports transparent community contributions via a git-based leaderboard, and provides two complementary ranking protocols: an ordinal Demsar/autorank approach and a magnitude-preserving z-score ranking approach. Evaluating several models - spanning in-context learners, fine-tuned foundation models, gradient-boosted trees, and MLPs - we find that model rankings shift substantially depending on the scoring rule: models that excel on point-estimate metrics can rank poorly on probabilistic ones, and the top-performing model under one proper scoring rule may rank noticeably lower under another. These results demonstrate that the choice of evaluation metric is not a technicality but a modelling decision - and, for applications where e.g. tail errors are disproportionately costly, a domain-specific requirement with direct consequences for model deployment.

翻译：诸如TabPFN和TabICL等表格基础模型已具备生成完整预测分布的能力，然而现有的回归评测基准几乎完全依赖点估计指标（RMSE、$R^2$）进行评估。这恰恰丢弃了这些模型旨在提供的分布信息——对于各类错误代价不等的关键领域而言，这构成了重大短板。我们提出ScoringBench——一个开放可扩展的基准测试，它通过全面的恰当评分规则套件（包括CRPS、CRLS、区间评分、能量评分及加权CRPS）以及标准点指标，对表格回归模型进行评估。ScoringBench涵盖来自不同领域的97个回归数据集，支持基于git排行榜的透明社区贡献，并提供两种互补的排名方案：基于德姆萨尔/奥托兰克方法的序数排名与保持量级差异的z分数排名。通过对上下文学习器、微调基础模型、梯度提升树及多层感知机等多种模型进行评估，我们发现模型排名会随评分规则显著变化：在点估计指标上表现优异的模型可能在概率性指标上排名靠后，而同一恰当评分规则下的最优模型在另一规则下可能排名明显降低。这些结果表明，评估指标的选择并非技术细节，而是一个建模决策——对于尾部错误代价尤为高昂的应用场景，这更是直接影响模型部署的领域特定需求。