Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods

Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.

翻译：数据增强可说是提升机器学习模型泛化性能的最重要正则化技术。该方法主要通过应用适当的数据变换操作来创建具有期望属性的新数据样本。尽管效果显著，但由于需要耗费大量时间通过试错手动生成和测试不同候选增强方案及其超参数，该过程往往颇具挑战。自动数据增强方法旨在实现流程自动化，现有先进技术通常依赖自动机器学习（AutoML）原理。本文对基于AutoML的数据增强技术进行了全面综述，探讨了通过数据操作、数据集成与数据合成等技术实现数据增强的多种途径。我们详细论述了实现数据增强各主要子任务的技术：搜索空间设计、超参数优化与模型评估。最后，我们系统对比分析了自动数据增强技术与基于经典增强方法的先进技术性能差异。结果表明，基于AutoML的数据增强方法目前优于采用传统方法的先进技术。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【机器学习术语宝典】机器学习中英文术语表

专知会员服务

61+阅读 · 2020年7月12日

【TPAMI2020】目标检测中的不平衡问题:综述论文，34页pdf

专知会员服务

55+阅读 · 2020年3月16日