The Na\"ive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Na\"ive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Na\"ive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Na\"ive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.
翻译:朴素贝叶斯已被证明是多变量分析中一种易于处理且高效的分类方法。然而,特征之间通常存在相关性,这违反了朴素贝叶斯的条件独立性假设,可能降低方法的性能。此外,数据集通常具有大量特征,这既可能复杂化结果解释,又可能减慢方法的执行速度。本文提出了一种稀疏朴素贝叶斯分类器,其具有三个特性。第一,稀疏性的实现考虑了协变量的相关结构。第二,可以使用不同的性能指标来指导特征选择。第三,可以包含对更高兴趣组的性能约束。我们的方案实现了智能搜索,从而获得了具有竞争力的运行时间,同时集成了分类性能指标方面的灵活性。研究结果表明,与具有良好参考价值的特征选择方法相比,所提出的稀疏朴素贝叶斯在平衡数据集上获得了具有竞争力的准确率、稀疏性和运行时间。对于类别不平衡(或不同重要性)的数据集,不同类别分类率之间实现了更好的折中。