Machine Learning-Based Intrusion Detection: Feature Selection versus Feature Extraction

Internet of things (IoT) has been playing an important role in many sectors, such as smart cities, smart agriculture, smart healthcare, and smart manufacturing. However, IoT devices are highly vulnerable to cyber-attacks, which may result in security breaches and data leakages. To effectively prevent these attacks, a variety of machine learning-based network intrusion detection methods for IoT networks have been developed, which often rely on either feature extraction or feature selection techniques for reducing the dimension of input data before being fed into machine learning models. This aims to make the detection complexity low enough for real-time operations, which is particularly vital in any intrusion detection systems. This paper provides a comprehensive comparison between these two feature reduction methods of intrusion detection in terms of various performance metrics, namely, precision rate, recall rate, detection accuracy, as well as runtime complexity, in the presence of the modern UNSW-NB15 dataset as well as both binary and multiclass classification. For example, in general, the feature selection method not only provides better detection performance but also lower training and inference time compared to its feature extraction counterpart, especially when the number of reduced features K increases. However, the feature extraction method is much more reliable than its selection counterpart, particularly when K is very small, such as K = 4. Additionally, feature extraction is less sensitive to changing the number of reduced features K than feature selection, and this holds true for both binary and multiclass classifications. Based on this comparison, we provide a useful guideline for selecting a suitable intrusion detection type for each specific scenario, as detailed in Tab. 14 at the end of Section IV.

翻译：物联网在智慧城市、智慧农业、智慧医疗和智能制造等多个领域发挥着重要作用。然而，物联网设备极易遭受网络攻击，可能导致安全漏洞和数据泄露。为有效防范此类攻击，针对物联网网络已开发出多种基于机器学习的网络入侵检测方法。这些方法通常依赖特征提取或特征选择技术，在将输入数据馈入机器学习模型前降低其维度，旨在使检测复杂度足够低以实现实时操作——这对任何入侵检测系统都至关重要。本文以现代UNSW-NB15数据集为基准，结合二分类与多分类场景，从精确率、召回率、检测准确率及运行时复杂度等多维度性能指标，对这两种入侵检测特征降维方法进行了全面比较。例如，总体上特征选择方法不仅检测性能更优，其训练与推理时间也低于特征提取方法，尤其在降维特征数K增大时更为明显。然而，当K值极小（如K=4）时，特征提取方法的可靠性远超特征选择方法。此外，无论二分类还是多分类场景，特征提取对降维特征数K变化的敏感度均低于特征选择。基于此比较，本文针对不同场景提供了选择合适入侵检测类型的实用指南，详见第四节末尾表14。