Automated data processing and feature engineering for deep learning and big data applications: a survey

Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.

翻译：现代人工智能方法旨在设计能够直接从数据中学习的算法。该方法取得了显著成果，尤其在有监督深度学习领域极大地推动了人工智能的发展。同时，由于学习过程高度自动化，它也简化了机器学习系统的设计。然而，传统深度学习流水线中的数据处理任务并未完全实现自动化。在大多数情况下，数据必须经过人工收集、预处理，并通过数据增强进一步扩展，才能有效用于训练。近年来，出现了实现这些任务自动化的专门技术。数据处理任务的自动化是由利用海量复杂异构数据进行机器学习和大数据应用的需求驱动的。如今，基于自动化机器学习技术的端到端自动化数据处理系统，能够通过自动化所有中间处理阶段，将原始数据转化为适用于大数据任务的有效特征。本文全面综述了深度学习流水线中自动化数据处理任务的方法，包括：自动化数据预处理（如数据清洗、标注、缺失值填补和类别型数据编码）、数据增强（包括使用生成式AI方法合成训练数据）以及特征工程（具体涉及自动特征提取、自动特征构造和自动特征选择）。除自动化特定数据处理任务外，我们还探讨了如何运用AutoML方法与工具来同时优化机器学习流水线的所有阶段。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日