FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce \textbf{FraudSMSWalker}, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \href{https://anonymous.4open.science/w/FraudMessageWalker-Bench}{anonymous link}.

翻译：短信欺诈日益呈现跨渠道特征：一条消息引导用户访问网页，最终风险取决于短信声明与页面内容及所请求用户操作的一致性。然而，现有评估要么聚焦于纯短信的钓鱼分类，要么暴露了使模型能够依赖声誉捷径的URL和域名线索。为弥补这一空白，我们提出了**FraudSMSWalker**——一个面向隐藏URL的短信到网页欺诈判断的可控基准。FraudSMSWalker包含699条双语链，涵盖10个服务场景，其中欺诈案例332条、良性案例367条。模型可见输入由短信上下文和经过清洗的网页证据组成，而原始URL、主机、域名、IP地址、重定向路径及声誉元数据均被排除。该基准进一步包含困难良性案例，其页面包含登录、支付、验证或账户管理要素，这些要素既可能在服务情境下合理出现，也常存在于诈骗流程中。我们在屏蔽浏览器代理协议下评估了九种网络代理，并进行了URL可见性消融实验。结果表明：当前代理能够检测可疑线索，但难以保持良性召回率，且其产生的正向预测往往缺乏观察证据的有力支持。这些发现将FraudSMSWalker定位为衡量网络代理在直接声誉捷径被抑制时，能否做出既准确又基于证据的欺诈判断的基准。相关代码与数据集可访问匿名链接获取。