Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.
翻译:经典机器学习模型(如线性模型和基于树的模型)在工业界被广泛应用。这些模型对数据分布敏感,因此特征预处理(将特征从一种分布转换为另一种分布)成为保障模型质量的关键步骤。手动构建特征预处理管道极具挑战性,因为数据科学家需要审慎决策选择哪些预处理器以及如何编排其顺序。本文研究如何实现表格数据的自动化特征预处理(Auto-FP)。由于搜索空间庞大,暴力求解方法成本过高。为应对这一挑战,我们发现Auto-FP可被建模为超参数优化(HPO)或神经架构搜索(NAS)问题。这一发现使我们能够将多种HPO和NAS算法扩展应用于Auto-FP问题。我们在45个公开机器学习数据集上对15种算法进行了全面评估与分析。总体而言,基于进化(evolution)的算法表现出领先的平均排名。令人惊讶的是,随机搜索成为强基线方法。许多在HPO和NAS中表现优异的基于代理模型(surrogate model)和基于多臂赌博机(bandit)的搜索算法,在Auto-FP中并未超越随机搜索。我们分析这一发现的原因,并通过瓶颈分析识别改进这些算法的机会。此外,我们探索了如何扩展Auto-FP以支持参数搜索,并比较了实现该目标的两种方式。最后,我们在AutoML背景下评估Auto-FP,并讨论主流AutoML工具的局限性。据我们所知,这是首次关于自动化特征预处理的系统性研究。我们期望这项工作能启发研究者开发专为Auto-FP设计的新算法。