With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
翻译:随着深度学习和基础模型在表格数据处理中的日益普及,对标准化、可靠基准测试的需求达到了前所未有的高度。然而,现有基准测试体系多为静态设计,即使发现缺陷、模型版本更新或新模型发布,其框架也未能同步调整。为此,我们推出了首个持续维护的动态表格数据基准测试系统——TabArena。为启动该平台,我们手动构建了具有代表性的数据集集合与精心实现的模型库,通过大规模基准测试研究初始化公共排行榜,并组建了经验丰富的维护团队。研究结果凸显了验证方法与超参数配置集成对模型潜力评估的关键影响。虽然梯度提升树在实际表格数据集上仍具竞争力,但我们发现深度学习方法在更充裕的时间预算与集成策略下已迎头赶上。与此同时,基础模型在小规模数据集上表现卓越。最后,我们证明跨模型集成能推动表格机器学习领域的技术前沿。值得注意的是,由于验证集过拟合现象,某些深度学习模型在跨模型集成中存在过度代表问题,我们呼吁模型开发者关注此问题。TabArena已通过公共排行榜、可复现代码及维护协议正式启动,动态基准测试平台可通过 https://tabarena.ai 访问。