TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn,Klemens Flöge,Oscar Key,Felix Birkel,Philipp Jund,Brendan Roof,Benjamin Jäger,Dominik Safaric,Simone Alessi,Adrian Hayler,Mihir Manium,Rosen Yu,Felix Jablonski,Shi Bin Hoo,Anurag Garg,Jake Robertson,Magnus Bühler,Vladyslav Moroshan,Lennart Purucker,Clara Cornu,Lilly Charlotte Wehrhahn,Alessandro Bonetto,Bernhard Schölkopf,Sauraj Gambhir,Noah Hollmann,Frank Hutter

The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

翻译：首个表格基础模型 TabPFN 及其后继版本 TabPFNv2 对表格人工智能领域产生了重大影响，已有数十种方法基于其构建，并在数百个不同应用场景中得到使用。本报告介绍了下一代表格基础模型 TabPFN-2.5，该模型专为处理多达 50,000 个数据点和 2,000 个特征的数据集而设计，其数据单元容量相比 TabPFNv2 提升了 20 倍。TabPFN-2.5 现已成为行业标准基准 TabArena（包含多达 100,000 个训练数据点的数据集）上的领先方法，显著优于经过调优的基于树的模型，并与 AutoGluon 1.4 的准确率相当——后者是一个经过四小时复杂调优的集成模型，甚至包含了先前的 TabPFNv2。值得注意的是，默认配置的 TabPFN-2.5 在中小型分类数据集（≤10,000 个数据点，500 个特征）上对默认配置的 XGBoost 实现了 100% 的胜率，在多达 100K 样本和 2K 特征的更大数据集上胜率达 87%（回归任务为 85%）。针对生产应用场景，我们引入了新的蒸馏引擎，可将 TabPFN-2.5 转换为紧凑的 MLP 或树集成模型，在保持其大部分精度的同时，实现数量级更低的延迟和即插即用部署。此次新版本的发布将立即提升众多基于 TabPFN 生态系统构建的应用与方法的性能表现。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《深度表格学习综述》

专知会员服务

44+阅读 · 2024年10月18日

表格数据的语言建模：基础、技术与演变综述

专知会员服务

39+阅读 · 2024年8月23日

非关系型表格理解前沿进展

专知会员服务

28+阅读 · 2024年7月9日