We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples (x, f(x)) given in the input, without requiring further parameter updates. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$\times$ speedup. This increases to a 5 700$\times$ speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.
翻译:我们提出TabPFN,一种经过训练的Transformer,能够在不到一秒内完成小型表格数据集的监督分类,无需超参数调优,且与最先进的分类方法性能相当。TabPFN采用上下文学习(ICL),通过输入中给定的带标签示例序列(x, f(x))进行预测,无需进一步更新参数。TabPFN完全嵌入于网络权重中,该网络将训练样本和测试样本作为集合值输入,并在单次前向传播中为整个测试集生成预测。TabPFN是一种先验数据拟合网络(PFN),经过一次离线训练,以逼近从我们先验中抽取的合成数据集上的贝叶斯推理。该先验融合了因果推理的思想:它涵盖了一个庞大的结构因果模型空间,并偏好简单结构。在OpenML-CC18套件中最多包含1000个训练数据点、最多100个纯数值特征(无缺失值)以及最多10个类别的18个数据集上,我们证明该方法明显优于提升树,且性能与复杂的最先进AutoML系统相当,同时实现了高达230倍的加速。在使用GPU时,加速比可提升至5700倍。我们还在来自OpenML的另外67个小型数值数据集上验证了这些结果。所有代码、训练好的TabPFN、交互式浏览器演示以及Colab笔记本均可在https://github.com/automl/TabPFN获取。