URLs play a crucial role in understanding and categorizing web content, particularly in tasks related to security control and online recommendations. While pre-trained models are currently dominating various fields, the domain of URL analysis still lacks specialized pre-trained models. To address this gap, this paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks. We first train a URL tokenizer on a corpus of billions of URLs to address URL data tokenization. Additionally, we propose two novel pre-training tasks: (1) self-supervised contrastive learning tasks, which strengthen the model's understanding of URL structure and the capture of category differences by distinguishing different variants of the same URL; (2) virtual adversarial training, aimed at improving the model's robustness in extracting semantic features from URLs. Finally, our proposed methods are evaluated on tasks including phishing URL detection, web page classification, and ad filtering, achieving state-of-the-art performance. Importantly, we also explore multi-task learning with URLBERT, and experimental results demonstrate that multi-task learning model based on URLBERT exhibit equivalent effectiveness compared to independently fine-tuned models, showing the simplicity of URLBERT in handling complex task requirements. The code for our work is available at https://github.com/Davidup1/URLBERT.
翻译:URL在理解与分类网络内容中起着关键作用,尤其在安全控制与在线推荐相关任务中。尽管预训练模型当前主导各个领域,但URL分析领域仍缺乏专门的预训练模型。为弥补这一空白,本文提出URLBERT——首个应用于多种URL分类或检测任务的预训练表示学习模型。我们首先在数十亿规模的URL语料库上训练URL分词器以解决URL数据分词问题。此外,我们提出两项新型预训练任务:(1)自监督对比学习任务,通过区分同一URL的不同变体增强模型对URL结构的理解及类别差异捕获能力;(2)虚拟对抗训练,旨在提升模型提取URL语义特征的鲁棒性。最后,我们在钓鱼URL检测、网页分类及广告过滤等任务上评估所提方法,取得了最优性能。重要地,我们还探索了基于URLBERT的多任务学习,实验结果表明基于URLBERT的多任务学习模型与独立微调模型效果相当,体现了URLBERT在处理复杂任务需求时的简便性。本工作的代码已开源至https://github.com/Davidup1/URLBERT。