DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.

翻译：摘要：数据预处理是机器学习流程中的关键步骤，旨在将原始数据转换为更适合下游ML模型使用的格式。然而，该过程通常成本高昂且耗时，往往需要领域专家的专业知识。现有自动化机器学习框架声称能实现数据预处理自动化，但常因采用受限的搜索空间而限制潜在性能提升，且因需多次训练ML模型导致速度过慢。本文提出DiffPrep方法，该方法可针对给定表格数据集与可微分ML模型自动高效地搜索数据预处理流水线，从而最大化ML模型性能。我们将数据预处理流水线搜索问题形式化为双层优化问题。为高效求解该问题，我们将离散、不可微的搜索空间转化为连续可微形式，并通过对该空间进行松弛处理，使得仅需训练一次ML模型即可通过梯度下降执行流水线搜索。实验表明，在评估的18个真实数据集中，DiffPrep在15个数据集上取得最佳测试准确率，最高可将模型测试准确率提升6.6个百分点。

相关内容

数据预处理

关注 1176

数据预处理（data preprocessing）是指在主要的处理以前对数据进行的一些处理。如对大部分地球物理面积性观测数据在进行转换或增强处理之前，首先将不规则分布的测网经过插值转换为规则网的处理，以利于计算机的运算。另外，对于一些剖面测量数据，如地震资料预处理有垂直叠加、重排、加道头、编辑、重新取样、多路编辑等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日