Machine learning (ML) models are increasingly deployed in cybersecurity applications such as phishing detection and network intrusion prevention. However, these models remain vulnerable to adversarial perturbations small, deliberate input modifications that can degrade detection accuracy and compromise interpretability. This paper presents an empirical study of adversarial robustness and explainability drift across two cybersecurity domains phishing URL classification and network intrusion detection. We evaluate the impact of L (infinity) bounded Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) perturbations on model accuracy and introduce a quantitative metric, the Robustness Index (RI), defined as the area under the accuracy perturbation curve. Gradient based feature sensitivity and SHAP based attribution drift analyses reveal which input features are most susceptible to adversarial manipulation. Experiments on the Phishing Websites and UNSW NB15 datasets show consistent robustness trends, with adversarial training improving RI by up to 9 percent while maintaining clean-data accuracy. These findings highlight the coupling between robustness and interpretability degradation and underscore the importance of quantitative evaluation in the design of trustworthy, AI-driven cybersecurity systems.
翻译:机器学习模型在网络安全应用中的部署日益广泛,例如钓鱼检测和网络入侵防御。然而,这些模型仍然容易受到对抗性扰动的影响——即微小、有意的输入修改,这些修改会降低检测准确性并损害可解释性。本文对两个网络安全领域——钓鱼URL分类和网络入侵检测——中的对抗鲁棒性和可解释性漂移进行了实证研究。我们评估了L∞有界的快速梯度符号方法和投影梯度下降扰动对模型准确性的影响,并引入了一个定量指标——鲁棒性指数,其定义为准确性-扰动曲线下的面积。基于梯度的特征敏感性和基于SHAP的归因漂移分析揭示了哪些输入特征最容易受到对抗性操纵。在Phishing Websites和UNSW-NB15数据集上的实验显示了一致的鲁棒性趋势,对抗训练可将鲁棒性指数提升高达9%,同时保持干净数据上的准确性。这些发现凸显了鲁棒性与可解释性退化之间的耦合关系,并强调了在可信赖、人工智能驱动的网络安全系统设计中定量评估的重要性。