Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and ask, 'does it matter?' We conduct the largest tabular data analysis to date, by comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than selecting the best algorithm. Next, we analyze 965 metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed feature distributions, heavy-tailed feature distributions, and other forms of dataset irregularities. Our insights act as a guide for practitioners to decide whether or not they need to run a neural net to reach top performance on their dataset. Our codebase and all raw results are available at https://github.com/naszilla/tabzilla.
翻译:表格数据是机器学习中最常用的数据类型之一。尽管近年来神经网络在表格数据领域取得了进展,但关于神经网络是否普遍优于梯度提升决策树(GBDTs)仍存在激烈讨论,近期部分研究声称GBDTs在表格数据上持续超越神经网络,或反之亦然。本研究后退一步,提出疑问:“这真的重要吗?”我们通过比较19种算法在176个数据集上的表现,开展了迄今最大规模的表格数据分析,发现“神经网络 vs. GBDT”的争论被过度强调:在数量惊人的数据集中,GBDTs与神经网络的性能差异可忽略不计,或对GBDT进行简单的超参数调优比选择最佳算法更为关键。我们进一步分析了965个元特征,以确定数据集的哪些属性更有利于神经网络或GBDTs的性能表现。例如,我们发现GBDTs在处理偏态分布、重尾分布及其他形式的数据集异常方面显著优于神经网络。这些洞察为从业者提供了指南,帮助判断是否需要运行神经网络以达到数据集的最佳性能。我们的代码库及所有原始结果可通过 https://github.com/naszilla/tabzilla 获取。