With the rise of sophisticated phishing attacks, there is a growing need for effective and economical detection solutions. This paper explores the use of large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to analyze both URLs and webpage screenshots via APIs, thus avoiding the complexities of training and maintaining AI systems. Our findings indicate that integrating these two data types substantially enhances detection performance over using either type alone. However, API usage incurs costs per query that depend on the number of input and output tokens. To address this, we propose a two-tiered agentic approach: initially, one agent assesses the URL, and if inconclusive, a second agent evaluates both the URL and the screenshot. This method not only maintains robust detection performance but also significantly reduces API costs by minimizing unnecessary multi-input queries. Cost analysis shows that with the agentic approach, GPT-4o mini can process about 4.2 times as many websites per $100 compared to the multimodal approach (107,440 vs. 25,626), and Gemini 1.5 Flash can process about 2.6 times more websites (2,232,142 vs. 862,068). These findings underscore the significant economic benefits of the agentic approach over the multimodal method, providing a viable solution for organizations aiming to leverage advanced AI for phishing detection while controlling expenses.
翻译:随着钓鱼攻击手段日益复杂化,对高效且经济的检测方案的需求日益增长。本文探讨了利用大型多模态智能体,特别是Gemini 1.5 Flash和GPT-4o mini,通过API分析URL及网页截图,从而避免训练和维护AI系统的复杂性。我们的研究结果表明,整合这两种数据类型相较于单独使用任一类型,能显著提升检测性能。然而,API使用会产生按查询计费的成本,该成本取决于输入和输出令牌的数量。为解决此问题,我们提出了一种双层智能体方法:首先由一个智能体评估URL,若无法确定,则由第二个智能体同时评估URL和截图。该方法不仅保持了稳健的检测性能,还通过减少不必要的多输入查询,显著降低了API成本。成本分析表明,采用智能体方法时,GPT-4o mini每100美元可处理的网站数量约为多模态方法的4.2倍(107,440 vs. 25,626),而Gemini 1.5 Flash可处理的网站数量约为多模态方法的2.6倍(2,232,142 vs. 862,068)。这些发现凸显了智能体方法相较于多模态方法具有显著的经济效益,为旨在利用先进AI进行钓鱼检测同时控制成本的组织提供了一个可行的解决方案。