Support vector machines (SVMs) are widely used and constitute one of the best examined and used machine learning models for two-class classification. Classification in SVM is based on a score procedure, yielding a deterministic classification rule, which can be transformed into a probabilistic rule (as implemented in off-the-shelf SVM libraries), but is not probabilistic in nature. On the other hand, the tuning of the regularization parameters in SVM is known to imply a high computational effort and generates pieces of information that are not fully exploited, not being used to build a probabilistic classification rule. In this paper we propose a novel approach to generate probabilistic outputs for the SVM. The new method has the following three properties. First, it is designed to be cost-sensitive, and thus the different importance of sensitivity (or true positive rate, TPR) and specificity (true negative rate, TNR) is readily accommodated in the model. As a result, the model can deal with imbalanced datasets which are common in operational business problems as churn prediction or credit scoring. Second, the SVM is embedded in an ensemble method to improve its performance, making use of the valuable information generated in the parameters tuning process. Finally, the probabilities estimation is done via bootstrap estimates, avoiding the use of parametric models as competing approaches. Numerical tests on a wide range of datasets show the advantages of our approach over benchmark procedures.
翻译:支持向量机(SVM)是二分类问题中应用最广泛且研究最充分的机器学习模型之一。SVM的分类基于评分过程,生成确定性分类规则,该规则可转化为概率规则(如现成SVM库中实现的那样),但其本质上并非概率模型。另一方面,SVM正则化参数的调优通常计算成本高昂,且生成的参数信息未被充分利用,也未用于构建概率分类规则。本文提出了一种为SVM生成概率输出的新方法。该方法具有以下三个特性:首先,它被设计为代价敏感,因此灵敏度(或真阳性率,TPR)与特异性(真阴性率,TNR)的不同重要性可在模型中直接体现。这使得模型能够处理不平衡数据集——这在客户流失预测或信用评分等实际商业问题中十分常见。其次,将SVM嵌入集成方法中,充分利用参数调优过程中生成的有价值信息来提升性能。最后,通过自助法估计概率,避免了使用参数模型等竞争方法。在多种数据集上的数值测试表明,我们的方法相对于基准流程具有明显优势。