Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.
翻译:经典机器学习模型(如线性模型和基于树的模型)在工业界广泛应用。这类模型对数据分布敏感,因此特征预处理(即特征从一种分布转换至另一种分布)是保证模型质量的关键步骤。手动构建特征预处理流程颇具挑战,数据科学家需艰难决定选择哪些预处理器及如何组合其顺序。本文研究表格数据自动化特征预处理(Auto-FP)方法。由于搜索空间庞大,暴力求解的成本极高。为应对此挑战,我们有趣地观察到Auto-FP可建模为超参数优化(HPO)或神经架构搜索(NAS)问题。这一发现使我们能够扩展多种HPO与NAS算法来解决Auto-FP问题。我们在45个公开机器学习数据集上对15种算法进行了全面评估与分析。总体而言,基于进化的算法展现出领先的平均排名。令人惊讶的是,随机搜索竟成为强基线。许多在HPO和NAS中表现优异的替代模型与赌博机搜索算法,在Auto-FP中并未超越随机搜索。我们分析了该发现的原因,并开展瓶颈分析以识别改进算法的机会。此外,我们探索了扩展Auto-FP以支持参数搜索的两种方式并比较其效果。最终,我们在自动机器学习(AutoML)背景下评估Auto-FP,并讨论了主流AutoML工具的局限性。据我们所知,这是首次关于自动化特征预处理的研究。希望我们的工作能启发研究者开发专用于Auto-FP的新算法。