The proliferation of malicious URLs has made their detection crucial for enhancing network security. While pre-trained language models offer promise, existing methods struggle with domain-specific adaptability, character-level information, and local-global encoding integration. To address these challenges, we propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination, effectively capturing subword and character-level information. It also includes a hierarchical representation module and a dynamic layer-wise attention mechanism for extracting features from low to high levels. Additionally, spatial pyramid pooling integrates local and global features. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models, achieving a 0.9941 AUC and correctly detecting all 20 malicious URLs in a case study. Code and data are available at https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet.
翻译:恶意URL的泛滥使其检测对于增强网络安全至关重要。尽管预训练语言模型展现出潜力,但现有方法在领域特定适应性、字符级信息以及局部-全局编码整合方面仍面临挑战。为应对这些挑战,我们提出PMANet,一种基于后训练语言模型引导的多层级特征注意力网络。PMANet采用包含三个自监督目标的后训练过程:掩码语言建模、噪声语言建模和领域判别,有效捕获子词与字符级信息。该网络还包含分层表示模块和动态层级注意力机制,用于从低到高提取特征。此外,空间金字塔池化整合了局部与全局特征。在多种场景下的实验,包括小规模数据、类别不平衡和对抗攻击,均证明PMANet优于现有先进模型,取得了0.9941的AUC值,并在案例研究中正确检测出全部20个恶意URL。代码与数据可在https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet获取。