TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale. To prevent the observer effect and ensure safety, a sandboxed operator agent drives a real GUI browser guided by visual motivation to elicit page behavior, freezing the session into an immutable evidence bundle. Separately, an adjudicator agent circumvents LLM context limitations by querying evidence on demand to verify a MITRE ATT&CK checklist, and generates an audit-ready report with extracted indicators of compromise (IOCs) and a final verdict. Evaluated on 708 reachable URLs from existing dataset (241 verified phishing from PhishTank and 467 benign from Tranco-derived crawling), TraceScope achieves 0.94 precision and 0.78 recall, substantially improving recall over three prior visual/reference-based classifiers while producing reproducible, analyst-grade evidence suitable for review. More importantly, we manually curated a dataset of real-world phishing emails to evaluate our system in a practical setting. Our evaluation reveals that TraceScope demonstrates superior performance in a real-world scenario as well, successfully detecting sophisticated phishing attempts that current state-of-the-art defenses fail to identify.

翻译：现代钓鱼攻击日益利用交互门控（如复选框/滑块验证）、延迟内容渲染及无标识凭据收集器来规避基于快照的URL分类器。这使URL排查从静态分类转变为交互式取证任务：分析人员需在主动浏览页面时避免自身暴露于潜在的运行时漏洞。我们提出TraceScope，一种可规模化实施该工作流的解耦排查管线。为防止观察者效应并确保安全性，沙箱化操作代理在视觉驱动下操控真实GUI浏览器以激发页面行为，并将会话冻结为不可篡改的证据包。独立裁决代理通过按需查询证据来规避LLM上下文限制，验证MITRE ATT&CK检查表，并生成包含提取威胁指标（IOC）和最终判定的审计就绪报告。基于现有数据集的708个可访问URL（PhishTank的241个已确认钓鱼链接及Tranco爬取衍生的467个正常链接）评估显示，TraceScope的精确率为0.94，召回率为0.78，相较于三种现有基于视觉/参照的分类器显著提升召回率，同时生成可复现、符合分析师审阅标准的证据。更重要的是，我们人工整理了真实钓鱼邮件数据集以在实战场景中评估系统性能。评估表明，TraceScope在真实场景中同样表现卓越，成功检测出当前最先进防御体系无法识别的复杂钓鱼攻击。