Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.
翻译:表格基础模型(TFM)现已能在越来越多的表格任务中与经过调优的梯度提升树相匹敌或超越它们,但没有单一TFM能在所有数据集上胜出。集成是常用的解决方法,但其效果低于预期。六种现代TFM构成了一个近乎冗余的集合:它们的平均成对Q统计量为0.961,接近1,使得任何凸组合都受限于上限。我们在153个OpenML分类任务上对六种TFM的六种集成策略进行了基准测试。最佳集成策略——两级级联堆叠——在计算量增加253倍的情况下,比最强的单一TFM提升了+0.18%的准确率。Friedman和Nemenyi分析将三种集成策略与最佳基础TFM归入同一等价组;而其他三种集成策略显著*逊于*最佳基础模型。使用逻辑回归元学习器的堆叠是最引人注目的案例:其准确率和ROC-AUC具有竞争力,但集成策略中对数损失排名最差。元学习器通过锐化类别边界来提高准确率,却破坏了校准性。我们推荐将贪婪选择作为实用默认方法。