Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.
翻译:摘要:数据预处理是机器学习流程中的关键步骤,旨在将原始数据转换为更适合下游ML模型使用的格式。然而,该过程通常成本高昂且耗时,往往需要领域专家的专业知识。现有自动化机器学习框架声称能实现数据预处理自动化,但常因采用受限的搜索空间而限制潜在性能提升,且因需多次训练ML模型导致速度过慢。本文提出DiffPrep方法,该方法可针对给定表格数据集与可微分ML模型自动高效地搜索数据预处理流水线,从而最大化ML模型性能。我们将数据预处理流水线搜索问题形式化为双层优化问题。为高效求解该问题,我们将离散、不可微的搜索空间转化为连续可微形式,并通过对该空间进行松弛处理,使得仅需训练一次ML模型即可通过梯度下降执行流水线搜索。实验表明,在评估的18个真实数据集中,DiffPrep在15个数据集上取得最佳测试准确率,最高可将模型测试准确率提升6.6个百分点。