Phishing websites continue to pose a significant security challenge, making the development of robust detection mechanisms essential. Brand Domain Identification (BDI) serves as a crucial step in many phishing detection approaches. This study systematically evaluates the effectiveness of features employed over the past decade for BDI, focusing on their weighted importance in phishing detection as of 2025. The primary objective is to determine whether the identified brand domain matches the claimed domain, utilizing popular features for phishing detection. To validate feature importance and evaluate performance, we conducted two experiments on a dataset comprising 4,667 legitimate sites and 4,561 phishing sites. In Experiment 1, we used the Weka tool to identify optimized and important feature sets out of 5: CN Information(CN), Logo Domain(LD),Form Action Domain(FAD),Most Common Link in Domain(MCLD) and Cookie Domain through its 4 Attribute Ranking Evaluator. The results revealed that none of the features were redundant, and Random Forest emerged as the best classifier, achieving an impressive accuracy of 99.7\% with an average response time of 0.08 seconds. In Experiment 2, we trained five machine learning models, including Random Forest, Decision Tree, Support Vector Machine, Multilayer Perceptron, and XGBoost to assess the performance of individual BDI features and their combinations. The results demonstrated an accuracy of 99.8\%, achieved with feature combinations of only three features: Most Common Link Domain, Logo Domain, Form Action and Most Common Link Domain,CN Info,Logo Domain using Random Forest as the best classifier. This study underscores the importance of leveraging key domain features for efficient phishing detection and paves the way for the development of real-time, scalable detection systems.
翻译:钓鱼网站持续构成重大安全挑战,因此开发鲁棒的检测机制至关重要。品牌域名识别(BDI)是众多钓鱼检测方法中的关键步骤。本研究系统评估了过去十年中用于BDI的特征的有效性,重点关注截至2025年这些特征在钓鱼检测中的加权重要性。主要目标是利用流行的钓鱼检测特征,判断识别的品牌域名是否与声称的域名匹配。为验证特征重要性并评估性能,我们在包含4,667个合法网站和4,561个钓鱼网站的数据集上进行了两项实验。在实验1中,我们使用Weka工具及其4种属性排序评估器,从5个特征中识别出优化且重要的特征集:CN信息(CN)、Logo域名(LD)、表单动作域名(FAD)、域名中最常见链接(MCLD)和Cookie域名。结果显示,所有特征均无冗余,且随机森林成为最佳分类器,取得了99.7%的惊人准确率,平均响应时间为0.08秒。在实验2中,我们训练了五种机器学习模型,包括随机森林、决策树、支持向量机、多层感知器和XGBoost,以评估单个BDI特征及其组合的性能。结果表明,仅使用三个特征的组合——最常见链接域名、Logo域名、表单动作与最常见链接域名,以及CN信息、Logo域名,并以随机森林作为最佳分类器,即可实现99.8%的准确率。本研究强调了利用关键域名特征进行高效钓鱼检测的重要性,并为开发实时、可扩展的检测系统铺平了道路。