Detecting malicious URLs is a crucial aspect of web search and mining, significantly impacting internet security. Though advancements in machine learning have improved the effectiveness of detection methods, these methods still face significant challenges in their capacity to generalize and their resilience against evolving threats. In this paper, we propose PyraTrans, an approach that combines the strengths of pretrained Transformers and pyramid feature learning for improving malicious URL detection. We implement PyraTrans by leveraging a pretrained CharBERT as the base and augmenting it with 3 connected feature modules: 1) The Encoder Feature Extraction module, which extracts representations from each encoder layer of CharBERT to obtain multi-order features; 2) The Multi-Scale Feature Learning Module, which captures multi-scale local contextual insights and aggregate information across different layer-levels; and 3) The Pyramid Spatial Attention Module, which learns hierarchical and spatial feature attentions, highlighting critical classification signals while reducing noise. The proposed approach addresses the limitations of the Transformer in local feature learning and spatial awareness, and enabling us to extract multi-order, multi-scale URL feature representations with enhanced attentional focus. PyraTrans is evaluated using 4 benchmark datasets, where it demonstrated significant advancements over prior baseline methods. Particularly, on the imbalanced dataset, our method, with just 10% of the data for training, the TPR is 3.3-6.5 times and the F1-score is 2.9-4.5 times that of the baseline. Our approach also demonstrates robustness against adversarial attacks. Codes and data are available at https://github.com/Alixyvtte/PyraTrans.
翻译:检测恶意URL是网络搜索与挖掘中的关键环节,对互联网安全具有重要影响。尽管机器学习的发展提升了检测方法的有效性,但这些方法在泛化能力和应对不断演变的威胁方面仍面临重大挑战。本文提出PyraTrans方法,该方法融合预训练Transformer与金字塔特征学习的优势,旨在提升恶意URL检测性能。PyraTrans以预训练的CharBERT为基础,并扩展了三个级联特征模块:1)编码器特征提取模块,从CharBERT各编码器层提取表征以获取多阶特征;2)多尺度特征学习模块,捕获跨层级的多尺度局部上下文信息并聚合不同层级特征;3)金字塔空间注意力模块,学习层次化空间特征注意力,在增强关键分类信号的同时降低噪声干扰。本文方法克服了Transformer在局部特征学习与空间感知方面的局限性,能够提取具有增强注意力聚焦特性的多阶、多尺度URL特征表征。在4个基准数据集上的评估表明,PyraTrans相较先前基线方法取得显著提升。特别地,在非平衡数据集中,仅使用10%训练数据时,本方法的TPR达到基线的3.3-6.5倍,F1分数达到基线的2.9-4.5倍。此外,该方法对对抗攻击展现出鲁棒性。代码与数据详见https://github.com/Alixyvtte/PyraTrans。