In evolving cyber landscapes, the detection of malicious URLs calls for cooperation and knowledge sharing across domains. However, collaboration is often hindered by concerns over privacy and business sensitivities. Federated learning addresses these issues by enabling multi-clients collaboration without direct data exchange. Unfortunately, if highly expressive Transformer models are used, clients may face intolerable computational burdens, and the exchange of weights could quickly deplete network bandwidth. In this paper, we propose Fed-urlBERT, a federated URL pre-trained model designed to address both privacy concerns and the need for cross-domain collaboration in cybersecurity. Fed-urlBERT leverages split learning to divide the pre-training model into client and server part, so that the client part takes up less extensive computation resources and bandwidth. Our appraoch achieves performance comparable to centralized model under both independently and identically distributed (IID) and two non-IID data scenarios. Significantly, our federated model shows about an 7% decrease in the FPR compared to the centralized model. Additionally, we implement an adaptive local aggregation strategy that mitigates heterogeneity among clients, demonstrating promising performance improvements. Overall, our study validates the applicability of the proposed Transformer federated learning for URL threat analysis, establishing a foundation for real-world collaborative cybersecurity efforts. The source code is accessible at https://github.com/Davidup1/FedURLBERT.
翻译:在持续演变的网络环境中,恶意URL检测需要跨域协作与知识共享。然而协作常因隐私顾虑和商业敏感性而受阻。联邦学习通过实现无需直接数据交换的多客户端协作解决了这些问题。但若采用高表达力的Transformer模型,客户端可能面临难以承受的计算负担,且权重交换会迅速耗尽网络带宽。本文提出Fed-urlBERT——一种面向网络安全领域隐私保护与跨域协作需求的联邦URL预训练模型。Fed-urlBERT利用拆分学习将预训练模型划分为客户端和服务器两部分,使客户端部分占用更少的计算资源与带宽。我们的方法在独立同分布和非独立同分布两种数据场景下均能达到与集中式模型相当的性能。值得注意的是,相比集中式模型,我们的联邦模型假阳性率降低了约7%。此外,我们实现了自适应本地聚合策略以缓解客户端异质性,展现出显著的性能提升。总体而言,本研究验证了所提出的Transformer联邦学习框架在URL威胁分析中的适用性,为实际网络安全协作奠定了基石。源代码详见https://github.com/Davidup1/FedURLBERT。