The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.
翻译:表格数据集的分析在机器学习(ML)的科学研究与实际应用中极为普遍。与许多其他机器学习任务不同,深度学习(DL)模型在此领域通常无法超越传统方法。先前的比较基准测试表明,深度学习模型的性能往往与梯度提升机(GBMs)等模型相当甚至更差。在本研究中,我们引入了一个综合基准测试,旨在更好地刻画深度学习模型表现优异的数据集类型。尽管目前已存在多个重要的表格数据集基准,但我们的贡献在于比较的多样性和深度:我们使用20种不同模型评估了111个数据集,涵盖回归和分类任务。这些数据集在规模上各不相同,既包含具有分类变量的数据集,也包含不含分类变量的数据集。重要的是,我们的基准测试包含了足够数量的深度学习模型表现最佳的数据集,从而能够深入分析深度学习模型表现出色的条件。基于此基准测试的结果,我们训练了一个模型,能够以86.1%的准确率(AUC 0.78)预测深度学习模型优于其他方法的场景。我们提出了从这一特征刻画中得出的见解,并将这些发现与先前的基准测试进行了比较。