Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.
翻译:训练机器学习模型需要大量数据集。然而,收集、整理和运营大规模且复杂的真实世界数据集会带来成本、伦理与法律问题以及数据可用性等方面的挑战。本文提出了一种新算法,可在真实数据极度稀缺的条件下生成大规模人工数据集,用于训练机器学习模型。该算法基于遗传算法,通过突变随机生成的数据集,随后用于神经网络训练。训练后,神经网络在真实数据批次上的表现被视为衡量其训练所用生成数据集适应度的代理指标。随着选择压力施加于生成数据集群体,不适宜的个体被淘汰,而最适应个体的适应度会逐代提升。该数据生成算法在鸢尾花数据集和乳腺癌威斯康星诊断数据集上进行了性能评估。在真实数据充足条件下,基于生成数据训练的机器学习模型平均准确率与基于真实数据训练的模型相当(鸢尾花数据集上均为0.956,p = 0.6996;乳腺癌数据集上分别为0.9377与0.9472,p = 0.1189)。在模拟真实数据极度稀缺条件下,基于生成数据训练的机器学习模型平均准确率显著高于基于稀缺真实数据训练的同类模型(鸢尾花数据集上为0.9533对比0.9067,p < 0.0001;乳腺癌数据集上为0.8692对比0.7701,p = 0.0091)。结论表明,该新算法能够在真实数据极度稀缺、或成本与数据敏感性导致无法采集大规模真实数据集时,生成大规模人工数据集以训练机器学习模型。