One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
翻译:机器学习领域日益增长的趋势之一是数据生成技术的应用,因为机器学习模型的性能依赖于训练数据集的规模。然而,在许多医学应用中,由于资源限制,收集大规模数据集具有挑战性,这会导致过拟合和泛化能力差。本文提出了一种新颖的方法——聚类潜在空间人工数据点生成(AGCL),旨在通过合成数据生成来提升小型医学数据集的分类性能。AGCL框架包括特征提取、K-means聚类、基于类别分离度量的聚类评估,以及从具有明确类别表征的聚类中生成合成数据点。该方法应用于帕金森病筛查,利用了面部表情数据,并在多种机器学习分类器上进行了评估。实验结果表明,与基线方法、GN和kNNMTD相比,AGCL显著提高了分类准确率。在不同情绪的多数据投票中,AGCL实现了最高的总体测试准确率83.33%和交叉验证准确率90.90%,证实了其在增强小型数据集方面的有效性。