Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.
翻译:表格数据是机器学习中最常用的数据类型之一。尽管近年来针对表格数据的神经网络(NNs)取得了进展,但关于神经网络是否在表格数据上普遍优于梯度提升决策树(GBDTs)的讨论仍然活跃,最近的一些研究要么认为GBDTs在表格数据上持续优于NNs,要么持相反观点。在本工作中,我们退后一步,质疑这一争论的重要性。为此,我们进行了迄今为止最大规模的表格数据分析,比较了19种算法在176个数据集上的表现,发现“NN与GBDT”之争被过度强调了:对于数量惊人的数据集,GBDTs与NNs之间的性能差异可以忽略不计,或者对GBDT进行轻量级超参数调优比在NNs和GBDTs之间选择更为重要。一个显著的例外是最近提出的先验数据拟合网络TabPFN:尽管它实际上受限于训练集大小为3000,但我们发现即使随机采样3000个训练数据点,它在平均表现上仍优于所有其他算法。接下来,我们分析了数十个元特征,以确定数据集的哪些特性使得NNs或GBDTs更适合取得良好性能。例如,我们发现GBDTs在处理偏斜或重尾特征分布以及其他形式的数据集不规则性方面远优于NNs。我们的见解可作为实践者确定哪些技术在其数据集上可能效果最佳的指南。最后,为了加速表格数据研究,我们发布了TabZilla基准测试套件:包含我们研究的36个“最难”数据集。我们的基准套件、代码库及所有原始结果均可在https://github.com/naszilla/tabzilla获取。