Phishing attacks remain a persistent threat to online security, demanding robust detection methods. This study investigates the use of machine learning to identify phishing URLs, emphasizing the crucial role of feature selection and model interpretability for improved performance. Employing Recursive Feature Elimination, the research pinpointed key features like "length_url," "time_domain_activation" and "Page_rank" as strong indicators of phishing attempts. The study evaluated various algorithms, including CatBoost, XGBoost, and Explainable Boosting Machine, assessing their robustness and scalability. XGBoost emerged as highly efficient in terms of runtime, making it well-suited for large datasets. CatBoost, on the other hand, demonstrated resilience by maintaining high accuracy even with reduced features. To enhance transparency and trustworthiness, Explainable AI techniques, such as SHAP, were employed to provide insights into feature importance. The study's findings highlight that effective feature selection and model interpretability can significantly bolster phishing detection systems, paving the way for more efficient and adaptable defenses against evolving cyber threats
翻译:钓鱼攻击始终是网络安全领域的持续威胁,亟需鲁棒的检测方法。本研究探讨利用机器学习识别钓鱼URL,强调特征选择与模型可解释性对提升性能的关键作用。通过递归特征消除方法,研究确定了"length_url"、"time_domain_activation"和"Page_rank"等关键特征作为钓鱼攻击的强指示因子。研究评估了包括CatBoost、XGBoost和可解释提升机在内的多种算法,检验其鲁棒性与可扩展性。XGBoost在运行时间方面表现出极高效率,特别适用于大规模数据集。而CatBoost则展现出强健性,即使在特征减少的情况下仍能保持高准确率。为增强透明度和可信度,研究采用SHAP等可解释人工智能技术来解析特征重要性。研究结果凸显,有效的特征选择与模型可解释性能显著强化钓鱼检测系统,为构建应对不断演变的网络威胁的更高效、适应性更强的防御体系奠定基础。