A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.

翻译：训练机器学习模型需要大量数据集。然而，收集、整理和运营大规模且复杂的真实世界数据集会带来成本、伦理与法律问题以及数据可用性等方面的挑战。本文提出了一种新算法，可在真实数据极度稀缺的条件下生成大规模人工数据集，用于训练机器学习模型。该算法基于遗传算法，通过突变随机生成的数据集，随后用于神经网络训练。训练后，神经网络在真实数据批次上的表现被视为衡量其训练所用生成数据集适应度的代理指标。随着选择压力施加于生成数据集群体，不适宜的个体被淘汰，而最适应个体的适应度会逐代提升。该数据生成算法在鸢尾花数据集和乳腺癌威斯康星诊断数据集上进行了性能评估。在真实数据充足条件下，基于生成数据训练的机器学习模型平均准确率与基于真实数据训练的模型相当（鸢尾花数据集上均为0.956，p = 0.6996；乳腺癌数据集上分别为0.9377与0.9472，p = 0.1189）。在模拟真实数据极度稀缺条件下，基于生成数据训练的机器学习模型平均准确率显著高于基于稀缺真实数据训练的同类模型（鸢尾花数据集上为0.9533对比0.9067，p < 0.0001；乳腺癌数据集上为0.8692对比0.7701，p = 0.0091）。结论表明，该新算法能够在真实数据极度稀缺、或成本与数据敏感性导致无法采集大规模真实数据集时，生成大规模人工数据集以训练机器学习模型。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

116+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日