Many industry verticals are confronted with small-sized tabular data. In this low-data regime, it is currently unclear whether the best performance can be expected from simple baselines, or more complex machine learning approaches that leverage meta-learning and ensembling. On 44 tabular classification datasets with sample sizes $\leq$ 500, we find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks (AutoPrognosis, AutoGluon) and off-the-shelf deep neural networks (TabPFN, HyperFast) on the majority of the benchmark datasets. We therefore recommend to consider logistic regression as the first choice for data-scarce applications with tabular data and provide practitioners with best practices for further method selection.
翻译:许多工业垂直领域面临着小规模表格数据的问题。在这种低数据场景下,目前尚不清楚最佳性能究竟来自简单基线方法,还是来自利用元学习和集成学习的更复杂机器学习方法。在44个样本量≤500的表格分类数据集上,我们发现L2正则化逻辑回归在大多数基准数据集上的表现与最先进的自动机器学习框架(AutoPrognosis、AutoGluon)及现成的深度神经网络(TabPFN、HyperFast)相当。因此,我们建议将逻辑回归作为数据匮乏表格数据应用场景的首选方法,并为从业者提供进一步方法选择的最佳实践指南。