We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or whether the random sample of data chosen during the process was not random enough or diverse enough in dependent variable class for the algorithm to generalized to unseen examples. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.
翻译:我们旨在通过实验证明,所提出的代价敏感PEGASOS SVM在多数类与少数类比例从8.6:1到130:1的不平衡数据集上具有良好的性能,并验证包含截距项(偏置)、正则化及参数是否会影响所选数据集的性能。尽管许多研究采用SMOTE方法,但我们追求一种计算成本较低的方法。通过分析学习曲线评估性能,这些曲线可诊断模型是否存在过拟合或欠拟合问题,以及在训练过程中随机选取的样本是否因因变量类别不够随机或多样而无法使算法泛化到未见样本。我们还将通过验证曲线观察超参数与测试/训练误差的关系。将我们的PEGASOS代价敏感SVM结果与Ding的LINEAR SVM DECIDL方法进行基准对比:他在某个数据集上获得了0.5的ROC-AUC值。本研究通过将核函数引入SVM扩展了Ding的工作,并使用Python替代MATLAB,因为Python的字典功能可在多参数交叉验证过程中存储混合数据类型。