Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.
翻译:现代人工智能方法旨在设计能够直接从数据中学习的算法。该方法取得了显著成果,尤其在有监督深度学习领域极大地推动了人工智能的发展。同时,由于学习过程高度自动化,它也简化了机器学习系统的设计。然而,传统深度学习流水线中的数据处理任务并未完全实现自动化。在大多数情况下,数据必须经过人工收集、预处理,并通过数据增强进一步扩展,才能有效用于训练。近年来,出现了实现这些任务自动化的专门技术。数据处理任务的自动化是由利用海量复杂异构数据进行机器学习和大数据应用的需求驱动的。如今,基于自动化机器学习技术的端到端自动化数据处理系统,能够通过自动化所有中间处理阶段,将原始数据转化为适用于大数据任务的有效特征。本文全面综述了深度学习流水线中自动化数据处理任务的方法,包括:自动化数据预处理(如数据清洗、标注、缺失值填补和类别型数据编码)、数据增强(包括使用生成式AI方法合成训练数据)以及特征工程(具体涉及自动特征提取、自动特征构造和自动特征选择)。除自动化特定数据处理任务外,我们还探讨了如何运用AutoML方法与工具来同时优化机器学习流水线的所有阶段。