Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
翻译:数据增强可说是提升机器学习模型泛化性能的最重要正则化技术。该方法主要通过应用适当的数据变换操作来创建具有期望属性的新数据样本。尽管效果显著,但由于需要耗费大量时间通过试错手动生成和测试不同候选增强方案及其超参数,该过程往往颇具挑战。自动数据增强方法旨在实现流程自动化,现有先进技术通常依赖自动机器学习(AutoML)原理。本文对基于AutoML的数据增强技术进行了全面综述,探讨了通过数据操作、数据集成与数据合成等技术实现数据增强的多种途径。我们详细论述了实现数据增强各主要子任务的技术:搜索空间设计、超参数优化与模型评估。最后,我们系统对比分析了自动数据增强技术与基于经典增强方法的先进技术性能差异。结果表明,基于AutoML的数据增强方法目前优于采用传统方法的先进技术。