Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection

The Uniform Resource Locator (URL), introduced in a connectivity-first era to define access and locate resources, remains historically limited, lacking future-proof mechanisms for security, trust, or resilience against fraud and abuse, despite the introduction of reactive protections like HTTPS during the cybersecurity era. In the current AI-first threatscape, deceptive URLs have reached unprecedented sophistication due to the widespread use of generative AI by cybercriminals and the AI-vs-AI arms race to produce context-aware phishing websites and URLs that are virtually indistinguishable to both users and traditional detection tools. Although AI-generated phishing accounted for a small fraction of filter-bypassing attacks in 2024, phishing volume has escalated over 4,000% since 2022, with nearly 50% more attacks evading detection. At the rate the threatscape is escalating, and phishing tactics are emerging faster than labeled data can be produced, zero-shot and few-shot learning with large language models (LLMs) offers a timely and adaptable solution, enabling generalization with minimal supervision. Given the critical importance of phishing URL detection in large-scale cybersecurity defense systems, we present a comprehensive benchmark of LLMs under a unified zero-shot and few-shot prompting framework and reveal operational trade-offs. Our evaluation uses a balanced dataset with consistent prompts, offering detailed analysis of performance, generalization, and model efficacy, quantified by accuracy, precision, recall, F1 score, AUROC, and AUPRC, to reflect both classification quality and practical utility in threat detection settings. We conclude few-shot prompting improves performance across multiple LLMs.

翻译：统一资源定位符（URL）诞生于以连接为核心的时代，旨在定义资源访问与定位机制，但其设计存在历史局限性，缺乏面向未来的安全、信任及抗欺诈与滥用的弹性机制——尽管网络安全时代已引入HTTPS等被动防护措施。在当前以人工智能为首要威胁的格局下，由于网络犯罪分子广泛使用生成式人工智能，以及人工智能对抗性竞赛催生出能生成情境感知型钓鱼网站与URL的技术，欺诈性URL已发展到前所未有的复杂程度，使用户和传统检测工具几乎无法辨别。尽管2024年人工智能生成的钓鱼攻击在绕过过滤器的攻击中占比尚小，但自2022年以来钓鱼攻击总量已激增超过4000%，且近50%的攻击能规避检测。在威胁态势持续升级、钓鱼战术涌现速度远超标注数据生成速度的背景下，基于大型语言模型的零样本与少样本学习提供了一种及时且适应性强的解决方案，能以最小监督实现泛化能力。鉴于钓鱼网址检测在大规模网络安全防御体系中的关键作用，本研究在统一的零样本与少样本提示框架下对多种大型语言模型进行全面基准测试，并揭示其运行权衡。评估采用具有一致提示的平衡数据集，通过准确率、精确率、召回率、F1分数、AUROC和AUPRC等量化指标，对模型性能、泛化能力及有效性进行细致分析，以反映威胁检测场景中的分类质量与实际效用。研究结论表明，少样本提示能普遍提升多种大型语言模型的检测性能。