Although advancements in machine learning have driven the development of malicious URL detection technology, current techniques still face significant challenges in their capacity to generalize and their resilience against evolving threats. In this paper, we propose PyraTrans, a novel method that integrates pretrained Transformers with pyramid feature learning to detect malicious URL. PyraTrans utilizes a pretrained CharBERT as its foundation and is augmented with three interconnected feature modules: 1) Encoder Feature Extraction, extracting multi-order feature matrices from each CharBERT encoder layer; 2) Multi-Scale Feature Learning, capturing local contextual insights at various scales and aggregating information across encoder layers; and 3) Spatial Pyramid Attention, focusing on regional-level attention to emphasize areas rich in expressive information. The proposed approach addresses the limitations of the Transformer in local feature learning and regional relational awareness, which are vital for capturing URL-specific word patterns, character combinations, or structural anomalies. In several challenging experimental scenarios, the proposed method has shown significant improvements in accuracy, generalization, and robustness in malicious URL detection. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. Codes and data are available at:https://github.com/Alixyvtte/PyraTrans.
翻译:尽管机器学习的进步推动了恶意URL检测技术的发展,但现有技术在泛化能力及应对新兴威胁的鲁棒性方面仍面临重大挑战。本文提出PyraTrans——一种融合预训练Transformer与金字塔特征学习的新型恶意URL检测方法。PyraTrans以预训练CharBERT为基础,并集成三个相互关联的特征模块:1)编码器特征提取模块,从每个CharBERT编码器层提取多阶特征矩阵;2)多尺度特征学习模块,在不同尺度捕获局部上下文信息并聚合跨编码器层的特征;3)空间金字塔注意力模块,聚焦区域级注意力以强化富含表达信息的区域。该方法有效解决了Transformer在局部特征学习与区域关联感知方面的局限性——这些能力对捕获URL特有的词模式、字符组合或结构异常至关重要。在多个具有挑战性的实验场景中,所提方法在恶意URL检测的准确性、泛化性和鲁棒性上均表现出显著提升。例如,在类别不平衡场景中F1分数最高提升40%,在对抗攻击场景中准确率超越最优基线14.13%。此外,我们通过案例研究表明:本方法可准确识别全部30个活跃恶意网页,而两种此前最先进方法分别遗漏了4个和7个恶意网页。代码与数据见:https://github.com/Alixyvtte/PyraTrans。