Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data

Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.

翻译：经典机器学习模型（如线性模型和基于树的模型）在工业界广泛应用。这类模型对数据分布敏感，因此特征预处理（即特征从一种分布转换至另一种分布）是保证模型质量的关键步骤。手动构建特征预处理流程颇具挑战，数据科学家需艰难决定选择哪些预处理器及如何组合其顺序。本文研究表格数据自动化特征预处理（Auto-FP）方法。由于搜索空间庞大，暴力求解的成本极高。为应对此挑战，我们有趣地观察到Auto-FP可建模为超参数优化（HPO）或神经架构搜索（NAS）问题。这一发现使我们能够扩展多种HPO与NAS算法来解决Auto-FP问题。我们在45个公开机器学习数据集上对15种算法进行了全面评估与分析。总体而言，基于进化的算法展现出领先的平均排名。令人惊讶的是，随机搜索竟成为强基线。许多在HPO和NAS中表现优异的替代模型与赌博机搜索算法，在Auto-FP中并未超越随机搜索。我们分析了该发现的原因，并开展瓶颈分析以识别改进算法的机会。此外，我们探索了扩展Auto-FP以支持参数搜索的两种方式并比较其效果。最终，我们在自动机器学习（AutoML）背景下评估Auto-FP，并讨论了主流AutoML工具的局限性。据我们所知，这是首次关于自动化特征预处理的研究。希望我们的工作能启发研究者开发专用于Auto-FP的新算法。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日