TabArena: A Living Benchmark for Machine Learning on Tabular Data

from arxiv, Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code; examples at https://tabarena.ai/code-examples; dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

翻译：随着深度学习和基础模型在表格数据处理中的日益普及，对标准化、可靠基准测试的需求达到了前所未有的高度。然而，现有基准测试体系多为静态设计，即使发现缺陷、模型版本更新或新模型发布，其框架也未能同步调整。为此，我们推出了首个持续维护的动态表格数据基准测试系统——TabArena。为启动该平台，我们手动构建了具有代表性的数据集集合与精心实现的模型库，通过大规模基准测试研究初始化公共排行榜，并组建了经验丰富的维护团队。研究结果凸显了验证方法与超参数配置集成对模型潜力评估的关键影响。虽然梯度提升树在实际表格数据集上仍具竞争力，但我们发现深度学习方法在更充裕的时间预算与集成策略下已迎头赶上。与此同时，基础模型在小规模数据集上表现卓越。最后，我们证明跨模型集成能推动表格机器学习领域的技术前沿。值得注意的是，由于验证集过拟合现象，某些深度学习模型在跨模型集成中存在过度代表问题，我们呼吁模型开发者关注此问题。TabArena已通过公共排行榜、可复现代码及维护协议正式启动，动态基准测试平台可通过 https://tabarena.ai 访问。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日