In this paper, we introduce PhishLang, an open-source, lightweight language model specifically designed for phishing website detection through contextual analysis of the website. Unlike traditional heuristic or machine learning models that rely on static features and struggle to adapt to new threats, and deep learning models that are computationally intensive, our model leverages MobileBERT, a fast and memory-efficient variant of the BERT architecture, to learn granular features characteristic of phishing attacks. PhishLang operates with minimal data preprocessing and offers performance comparable to leading deep learning anti-phishing tools, while being significantly faster and less resource-intensive. Over a 3.5-month testing period, PhishLang successfully identified 25,796 phishing URLs, many of which were undetected by popular antiphishing blocklists, thus demonstrating its potential to enhance current detection measures. Capitalizing on PhishLang's resource efficiency, we release the first open-source fully client-side Chromium browser extension that provides inference locally without requiring to consult an online blocklist and can be run on low-end systems with no impact on inference times. Our implementation not only outperforms prevalent (server-side) phishing tools, but is significantly more effective than the limited commercial client-side measures available. Furthermore, we study how PhishLang can be integrated with GPT-3.5 Turbo to create explainable blocklisting -- which, upon detection of a website, provides users with detailed contextual information about the features that led to a website being marked as phishing.
翻译:本文提出PhishLang,一种专为钓鱼网站检测设计的开源轻量级语言模型,通过对网站内容进行上下文分析实现检测。与传统依赖静态特征且难以适应新型威胁的启发式或机器学习模型,以及计算密集的深度学习模型不同,本模型采用MobileBERT——一种快速且内存高效的BERT架构变体——来学习钓鱼攻击的细粒度特征。PhishLang仅需极少数据预处理,其性能可与主流深度学习反钓鱼工具相媲美,同时显著提升了运行速度并降低了资源消耗。在为期3.5个月的测试中,PhishLang成功识别了25,796个钓鱼URL,其中大量未被主流反钓鱼黑名单收录,这证明了其增强现有检测机制的潜力。基于PhishLang的资源效率优势,我们发布了首个开源全客户端Chromium浏览器扩展程序,可在本地进行推理而无需查询在线黑名单,并能在低端设备上运行且不影响推理速度。我们的实现不仅优于主流(服务端)钓鱼检测工具,其有效性也显著超越当前有限的商业客户端解决方案。此外,我们研究了如何将PhishLang与GPT-3.5 Turbo集成以构建可解释的黑名单机制——当检测到钓鱼网站时,该机制可向用户详细说明导致网站被标记为钓鱼网站的特征上下文信息。