The escalating sophistication of phishing emails necessitates a shift beyond traditional rule-based and conventional machine-learning-based detectors. Although large language models (LLMs) offer strong natural language understanding, using them as standalone classifiers often yields elevated falsepositive (FP) rates, which mislabel legitimate emails as phishing and create significant operational burden. This paper presents a personalized phishing detection framework that integrates LLMs with retrieval-augmented generation (RAG). For each message, the system constructs user-specific context by retrieving a compact set of the user's historical legitimate emails and enriching it with real-time domain and URL reputation from a cyber-threat intelligence platform, then conditions the LLM's decision on this evidence. We evaluate four open-source LLMs (Llama4-Scout, DeepSeek-R1, Mistral-Saba, and Gemma2) on an email dataset collected from public and institutional sources. Results show high performance; for example, Llama4-Scout attains an F1-score of 0.9703 and achieves a 66.7% reduction in FPs with RAG. These findings validate that a RAG-based, user-profiling approach is both feasible and effective for building high-precision, low-friction email security systems that adapt to individual communication patterns.
翻译:钓鱼邮件复杂度的不断提升,要求我们必须超越传统的基于规则和常规机器学习的检测方法。尽管大型语言模型(LLMs)具备强大的自然语言理解能力,但将其作为独立分类器使用往往会导致较高的误报率,即将合法邮件错误标记为钓鱼邮件,从而造成显著的操作负担。本文提出了一种个性化的钓鱼检测框架,该框架将LLMs与检索增强生成(RAG)技术相结合。对于每条消息,系统通过检索用户历史合法邮件的一个紧凑集合,并利用网络威胁情报平台的实时域名和URL信誉信息对其进行增强,从而构建用户特定的上下文,然后基于此证据来引导LLM的决策。我们在一个从公共和机构来源收集的电子邮件数据集上评估了四种开源LLM(Llama4-Scout、DeepSeek-R1、Mistral-Saba和Gemma2)。结果显示性能优异;例如,Llama4-Scout的F1分数达到0.9703,并且在使用RAG后实现了66.7%的误报率降低。这些发现验证了基于RAG的用户画像方法对于构建适应个体通信模式的高精度、低摩擦电子邮件安全系统既是可行的,也是有效的。