Phishing attacks have inflicted substantial losses on individuals and businesses alike, necessitating the development of robust and efficient automated phishing detection approaches. Reference-based phishing detectors (RBPDs), which compare the logos on a target webpage to a known set of logos, have emerged as the state-of-the-art approach. However, a major limitation of existing RBPDs is that they rely on a manually constructed brand knowledge base, making it infeasible to scale to a large number of brands, which results in false negative errors due to the insufficient brand coverage of the knowledge base. To address this issue, we propose an automated knowledge collection pipeline, using which we collect a large-scale multimodal brand knowledge base, KnowPhish, containing 20k brands with rich information about each brand. KnowPhish can be used to boost the performance of existing RBPDs in a plug-and-play manner. A second limitation of existing RBPDs is that they solely rely on the image modality, ignoring useful textual information present in the webpage HTML. To utilize this textual information, we propose a Large Language Model (LLM)-based approach to extract brand information of webpages from text. Our resulting multimodal phishing detection approach, KnowPhish Detector (KPD), can detect phishing webpages with or without logos. We evaluate KnowPhish and KPD on a manually validated dataset, and a field study under Singapore's local context, showing substantial improvements in effectiveness and efficiency compared to state-of-the-art baselines.
翻译:钓鱼攻击已对个人和企业造成重大损失,亟需开发鲁棒高效的自动化钓鱼检测方法。基于参考的钓鱼检测器通过将目标网页上的标识与已知标识集进行比较,已成为当前最先进的方法。然而,现有基于参考的钓鱼检测器的主要局限在于依赖人工构建的品牌知识库,难以扩展至海量品牌,导致因知识库品牌覆盖不足而产生漏报错误。为解决此问题,我们提出自动化知识收集流程,据此构建了大规模多模态品牌知识库KnowPhish,涵盖2万个品牌且包含各品牌的丰富信息。KnowPhish能以即插即用方式提升现有基于参考的钓鱼检测器的性能。现有方法的第二项局限是仅依赖图像模态,忽略了网页HTML中包含的有效文本信息。为利用文本信息,我们提出基于大语言模型的方法从文本中提取网页品牌信息。最终形成的多模态钓鱼检测方法KnowPhish检测器能够检测含标识或无标识的钓鱼网页。我们在人工验证数据集及新加坡本地场景的实地研究中评估KnowPhish及其检测器,结果表明相较于最先进的基线方法,本方法在效能与效率上均有显著提升。