We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or whether the random sample of data chosen during the process was not random enough or diverse enough in dependent variable class for the algorithm to generalized to unseen examples. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.
翻译:我们旨在通过实验证明,所提出的代价敏感型PEGASOS支持向量机在多数类与少数类比例范围为8.6:1至130:1的不平衡数据集上具有良好性能,并验证截距项(偏置)、正则化及参数设置是否影响所选数据集的分类效果。尽管许多研究采用SMOTE方法,我们旨在寻求一种计算成本较低的方法。通过观察学习曲线评估性能:这些曲线可诊断模型是否存在过拟合或欠拟合问题,以及过程中选取的随机样本在因变量类别分布上是否缺乏随机性或多样性,导致算法难以泛化至未见样本。此外,我们还将通过验证曲线分析超参数变化与测试/训练误差之间的关系。将我们的PEGASOS代价敏感型支持向量机结果与Ding的LINEAR SVM DECIDL方法进行基准比较——他在某数据集上获得了0.5的ROC-AUC值。本研究通过引入核方法扩展了Ding的工作,将SVM核函数纳入框架。我们选用Python而非MATLAB实现,因为Python的字典结构可在多参数交叉验证过程中存储混合数据类型。