A Novel Machine Learning Classifier Based on Genetic Algorithms and Data Importance Reformatting

In this paper, a novel classification algorithm that is based on Data Importance (DI) reformatting and Genetic Algorithms (GA) named GADIC is proposed to overcome the issues related to the nature of data which may hinder the performance of the Machine Learning (ML) classifiers. GADIC comprises three phases which are data reformatting phase which depends on DI concept, training phase where GA is applied on the reformatted training dataset, and testing phase where the instances of the reformatted testing dataset are being averaged based on similar instances in the training dataset. GADIC is an approach that utilizes the exiting ML classifiers with involvement of data reformatting, using GA to tune the inputs, and averaging the similar instances to the unknown instance. The averaging of the instances becomes the unknown instance to be classified in the stage of testing. GADIC has been tested on five existing ML classifiers which are Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Logistic Regression (LR), Decision Tree (DT), and Na\"ive Bayes (NB). All were evaluated using seven open-source UCI ML repository and Kaggle datasets which are Cleveland heart disease, Indian liver patient, Pima Indian diabetes, employee future prediction, telecom churn prediction, bank customer churn, and tech students. In terms of accuracy, the results showed that, with the exception of approximately 1% decrease in the accuracy of NB classifier in Cleveland heart disease dataset, GADIC significantly enhanced the performance of most ML classifiers using various datasets. In addition, KNN with GADIC showed the greatest performance gain when compared with other ML classifiers with GADIC followed by SVM while LR had the lowest improvement. The lowest average improvement that GADIC could achieve is 5.96%, whereas the maximum average improvement reached 16.79%.

翻译：本文提出了一种基于数据重要性（DI）重构和遗传算法（GA）的新型分类算法GADIC，以克服可能影响机器学习（ML）分类器性能的数据本质问题。GADIC包含三个阶段：依赖DI概念的数据重构阶段、对重构后的训练数据集应用GA的训练阶段，以及基于训练数据集中相似实例对重构测试数据集实例进行平均化的测试阶段。GADIC是一种利用现有ML分类器的方法，通过数据重构、使用GA调整输入参数，并对与未知实例相似的实例进行平均化处理。在测试阶段，实例的平均化结果即成为待分类的未知实例。GADIC已在五种现有ML分类器上进行了测试，包括支持向量机（SVM）、K最近邻（KNN）、逻辑回归（LR）、决策树（DT）和朴素贝叶斯（NB）。所有实验均使用七个开源UCI机器学习库和Kaggle数据集进行评估，包括克利夫兰心脏病数据集、印度肝病患者数据集、皮马印第安人糖尿病数据集、员工未来预测数据集、电信客户流失预测数据集、银行客户流失数据集以及技术专业学生数据集。在准确率方面，结果表明，除了NB分类器在克利夫兰心脏病数据集上的准确率下降约1%外，GADIC在使用不同数据集时显著提升了大多数ML分类器的性能。此外，结合GADIC的KNN相比其他结合GADIC的ML分类器表现出最大的性能增益，其次是SVM，而LR的改进程度最低。GADIC可实现的最低平均改进为5.96%，而最高平均改进达到16.79%。