Support Vector Machine (SVM) is a robust machine learning algorithm with broad applications in classification, regression, and outlier detection. SVM requires tuning the regularization parameter (RP) which controls the model capacity and the generalization performance. Conventionally, the optimum RP is found by comparison of a range of values through the Cross-Validation (CV) procedure. In addition, for non-linearly separable data, the SVM uses kernels where a set of kernels, each with a set of parameters, denoted as a grid of kernels, are considered. The optimal choice of RP and the grid of kernels is through the grid-search of CV. By stochastically analyzing the behavior of the regularization parameter, this work shows that the SVM performance can be modeled as a function of separability and scatteredness (S&S) of the data. Separability is a measure of the distance between classes, and scatteredness is the ratio of the spread of data points. In particular, for the hinge loss cost function, an S&S ratio-based table provides the optimum RP. The S&S ratio is a powerful value that can automatically detect linear or non-linear separability before using the SVM algorithm. The provided S&S ratio-based table can also provide the optimum kernel and its parameters before using the SVM algorithm. Consequently, the computational complexity of the CV grid-search is reduced to only one time use of the SVM. The simulation results on the real dataset confirm the superiority and efficiency of the proposed approach in the sense of computational complexity over the grid-search CV method.
翻译:支持向量机(SVM)是一种鲁棒的机器学习算法,广泛应用于分类、回归和异常检测。SVM需要调整正则化参数(RP),该参数控制模型容量和泛化性能。传统上,最优RP通过交叉验证(CV)过程比较一系列参数值来确定。此外,对于非线性可分数据,SVM使用核函数,考虑一组核函数及其各自的参数,称为核函数网格。RP和核函数网格的最优选择通过CV网格搜索实现。本文通过随机分析正则化参数的行为,表明SVM性能可建模为数据可分离性与分散性(S&S)的函数。可分离性衡量类别间的距离,而分散性则表示为数据点分布的比率。特别地,对于铰链损失代价函数,基于S&S比率的表格可提供最优RP。S&S比率是一个强大的度量,能在使用SVM算法前自动检测线性或非线性可分性。所提供的基于S&S比率的表格还能在使用SVM算法前提供最优核函数及其参数。因此,CV网格搜索的计算复杂度降为仅需使用一次SVM。在真实数据集上的仿真结果证实了所提方法在计算复杂度方面相较于网格搜索CV方法的优越性和高效性。