Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. Next, we analyze dozens of metafeatures to determine what \emph{properties} of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.
翻译:表格数据是机器学习中最常用的数据类型之一。尽管近年来神经网络(NNs)在表格数据处理方面取得了进展,但关于NNs是否普遍优于梯度提升决策树(GBDTs)仍存在激烈讨论——近期多项研究分别主张GBDTs在表格数据上始终优于NNs,或反之。本研究旨在退一步审视这场争论的重要性。我们开展了迄今为止规模最大的表格数据对比分析,在176个数据集上评估了19种算法,发现'NN vs. GBDT'之争被过度强调:对于相当高比例的数据集,要么GBDTs与NNs的性能差异可忽略不计,要么对GBDT进行轻量超参数调优比选择NNs或GBDTs更为关键。进一步,我们分析了数十个元特征以确定数据集的哪些特性更有利于NNs或GBDTs的表现。例如,我们发现GBDTs在处理偏态分布、重尾分布及其他数据不规则性方面显著优于NNs。这些发现为从业者提供了针对不同数据集选择最优技术的指导准则。最后,为加速表格数据研究,我们发布了TabZilla基准测试套件(TabZilla Benchmark Suite):包含本研究中最具挑战性的36个数据集。该基准套件、代码库及全部原始结果均可在https://github.com/naszilla/tabzilla获取。