Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.
翻译:表格数据是机器学习中最常用的数据类型之一。尽管近年来神经网络在表格数据领域取得了进展,但关于神经网络是否普遍优于梯度提升决策树的讨论仍然激烈,近期多项研究分别主张梯度提升决策树或神经网络在表格数据上表现更优。本研究暂缓争论焦点,质疑该讨论的重要性。为此,我们开展了迄今最大规模的表格数据分析,在176个数据集上比较了19种算法,发现“神经网络与梯度提升决策树”之争被过度强调:令人惊讶的是,大量数据集中两者性能差异可忽略不计,或对梯度提升决策树进行轻量超参数调优比权衡选择神经网络与梯度提升决策树更为关键。一个显著例外是近期提出的基于先验数据拟合网络TabPFN:尽管其训练集规模严格限制在3000以内,但即使随机抽取3000个训练数据点,该网络平均性能仍超越所有其他算法。此外,我们分析了数十个元特征以揭示数据集属性如何影响神经网络或梯度提升决策树的适用性。例如,梯度提升决策树在处理偏态或重尾分布特征及其他数据不规则性方面显著优于神经网络。这些发现可为从业者针对特定数据集选择最优技术提供指导。最后,为加速表格数据研究,我们发布了TabZilla基准测试套件:包含36个研究中最具挑战性的数据集。该套件、代码库及所有原始结果均可在https://github.com/naszilla/tabzilla获取。