Phishing attacks have inflicted substantial losses on individuals and businesses alike, necessitating the development of robust and efficient automated phishing detection approaches. Reference-based phishing detectors (RBPDs), which compare the logos on a target webpage to a known set of logos, have emerged as the state-of-the-art approach. However, a major limitation of existing RBPDs is that they rely on a manually constructed brand knowledge base, making it infeasible to scale to a large number of brands, which results in false negative errors due to the insufficient brand coverage of the knowledge base. To address this issue, we propose an automated knowledge collection pipeline, using which we collect and release a large-scale multimodal brand knowledge base, KnowPhish, containing 20k brands with rich information about each brand. KnowPhish can be used to boost the performance of existing RBPDs in a plug-and-play manner. A second limitation of existing RBPDs is that they solely rely on the image modality, ignoring useful textual information present in the webpage HTML. To utilize this textual information, we propose a Large Language Model (LLM)-based approach to extract brand information of webpages from text. Our resulting multimodal phishing detection approach, KnowPhish Detector (KPD), can detect phishing webpages with or without logos. We evaluate KnowPhish and KPD on a manually validated dataset, and on a field study under Singapore's local context, showing substantial improvements in effectiveness and efficiency compared to state-of-the-art baselines.
翻译:网络钓鱼攻击对个人和企业均造成了巨大损失,亟需开发稳健高效的自动化钓鱼检测方法。基于参考的钓鱼检测器(RBPDs)通过将目标网页上的logo与已知logo集进行比对,已成为当前最先进的技术方案。然而现有RBPDs的一个主要局限在于依赖人工构建的品牌知识库,导致其无法扩展至大规模品牌数量,进而因知识库品牌覆盖不足而产生漏报错误。针对此问题,我们提出了一种自动化知识采集流程,利用该流程收集并发布了大规模多模态品牌知识库KnowPhish,涵盖2万个品牌及其丰富信息。KnowPhish能够以即插即用方式提升现有RBPDs的性能。现有RBPDs的第二个局限是仅依赖图像模态,忽略了网页HTML中丰富的文本信息。为利用这些文本信息,我们提出基于大型语言模型(LLM)的方法,从文本中提取网页的品牌信息。由此构建的多模态钓鱼检测方案KnowPhish检测器(KPD)能够检测包含或不含logo的钓鱼网页。我们在人工验证数据集以及新加坡本地场景的实地研究中评估了KnowPhish与KPD,结果表明相较于最先进的基线方法,两者在有效性和效率上均有显著提升。