基于令牌对比表征增强与多粒度融合的IP增强型多模态恶意URL检测 (IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion)

Malicious URL detection remains a critical cybersecurity challenge as adversaries increasingly employ sophisticated evasion techniques including obfuscation, character-level perturbations, and adversarial attacks. Although pre-trained language models (PLMs) like BERT have shown potential for URL analysis tasks, three limitations persist in current implementations: (1) inability to effectively model the non-natural hierarchical structure of URLs, (2) insufficient sensitivity to character-level obfuscation, and (3) lack of mechanisms to incorporate auxiliary network-level signals such as IP addresses-all essential for robust detection. To address these challenges, we propose CURL-IP, an advanced multi-modal detection framework incorporating three key innovations: (1) Token-Contrastive Representation Enhancer, which enhances subword token representations through token-aware contrastive learning to produce more discriminative and isotropic embeddings; (2) Cross-Layer Multi-Scale Aggregator, employing hierarchical aggregation of Transformer outputs via convolutional operations and gated MLPs to capture both local and global semantic patterns across layers; and (3) Blockwise Multi-Modal Coupler that decomposes URL-IP features into localized block units and computes cross-modal attention weights at the block level, enabling fine-grained inter-modal interaction. This architecture enables simultaneous preservation of fine-grained lexical cues, contextual semantics, and integration of network-level signals. Our evaluation on large-scale real-world datasets shows the framework significantly outperforms state-of-the-art baselines across binary and multi-class classification tasks.

翻译：恶意URL检测作为一项关键的网络安全挑战，随着攻击者越来越多地采用包括混淆、字符级扰动和对抗攻击在内的复杂规避技术而持续存在。尽管像BERT这样的预训练语言模型在URL分析任务中显示出潜力，但当前实现仍存在三个局限：(1) 无法有效建模URL的非自然层次结构，(2) 对字符级混淆的敏感性不足，以及(3) 缺乏整合辅助网络级信号（如IP地址）的机制——而这些对于稳健检测都至关重要。为应对这些挑战，我们提出了CURL-IP，一个先进的多模态检测框架，包含三项关键创新：(1) 令牌对比表征增强器，通过令牌感知的对比学习增强子词令牌表征，以产生更具区分性和各向同性的嵌入；(2) 跨层多尺度聚合器，通过卷积操作和门控MLP对Transformer输出进行层次化聚合，以捕获跨层的局部和全局语义模式；(3) 分块多模态耦合器，将URL-IP特征分解为局部块单元，并在块级别计算跨模态注意力权重，从而实现细粒度的模态间交互。该架构能够同时保留细粒度词汇线索、上下文语义，并整合网络级信号。我们在大规模真实数据集上的评估表明，该框架在二分类和多分类任务上均显著优于最先进的基线方法。