Bayesian classifiers perform well when each of the features is completely independent of the other which is not always valid in real world application. The aim of this study is to implement and compare the performances of each variant of Bayesian classifier (Multinomial, Bernoulli, and Gaussian) on anomaly detection in network intrusion, and to investigate whether there is any association between each variant assumption and their performance. Our investigation showed that each variant of Bayesian algorithm blindly follows its assumption regardless of feature property, and that the assumption is the single most important factor that influences their accuracy. Experimental results show that Bernoulli has accuracy of 69.9% test (71% train), Multinomial has accuracy of 31.2% test (31.2% train), while Gaussian has accuracy of 81.69% test (82.84% train). Going deeper, we investigated and found that each Naive Bayes variants performances and accuracy is largely due to each classifier assumption, Gaussian classifier performed best on anomaly detection due to its assumption that features follow normal distributions which are continuous, while multinomial classifier have a dismal performance as it simply assumes discreet and multinomial distribution.
翻译:贝叶斯分类器在特征完全独立时表现良好,但这一条件在现实应用中往往难以满足。本研究旨在实现并比较贝叶斯分类器各变体(多项分布、伯努利分布、高斯分布)在网络入侵异常检测中的性能,同时探究各变体假设与其性能之间的关联。研究表明,每个贝叶斯算法变体均严格遵循其自身假设而忽略特征属性,且该假设是影响其准确性的唯一最重要因素。实验结果显示:伯努利变体测试集准确率为69.9%(训练集71%),多项分布变体为31.2%(训练集31.2%),高斯变体为81.69%(训练集82.84%)。进一步分析发现,各朴素贝叶斯变体的性能主要取决于其分类器假设:高斯分类器因其假设特征符合连续正态分布而在异常检测中表现最佳,而多项分布分类器因简单假设特征为离散多项分布而导致性能欠佳。