AI is increasingly being used to assist fraud and cybercrime. However, it is unclear the extent to which current large language models can provide useful information for complex criminal activity. Working with law enforcement and policy experts, we developed multi-turn evaluations for three fraud and cybercrime scenarios (romance scams, CEO impersonation, and identity theft). Our evaluations focus on text-to-text interactions. In each scenario, we evaluate whether models provide actionable assistance beyond information typically available on the web, as assessed by domain experts. We do so in ways designed to resemble real-world misuse, such as breaking down requests for fraud into a sequence of seemingly benign queries. We found that (1) current large language models provide minimal actionable information for fraud and cybercrime without the use of advanced jailbreaking techniques, (2) model safeguards have significant impact on the provision of information, with the two open-weight large language models fine-tuned to remove safety guardrails providing the most actionable and useful responses, and (3) decomposing requests into benign-seeming queries elicited more assistance than explicitly malicious framing or basic system-level jailbreaks. Overall, the results suggest that current text-generation models provide relatively minimal uplift for fraud and cybercrime through information provision, without extensive effort to circumvent safeguards. This work contributes a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt.
翻译:人工智能正日益被用于协助欺诈与网络犯罪。然而,当前大型语言模型能在多大程度上为复杂犯罪活动提供有用信息尚不明确。我们与执法机构和政策专家合作,针对三种欺诈与网络犯罪场景(情感诈骗、CEO身份冒充和身份盗窃)开发了多轮评估方法。我们的评估聚焦于文本到文本的交互过程。在每个场景中,我们通过领域专家评估,检验模型是否提供了超越网络常见信息的可操作协助。评估方式力求模拟现实世界的滥用模式,例如将欺诈请求分解为一系列看似无害的查询序列。研究发现:(1)若不使用高级越狱技术,当前大型语言模型为欺诈和网络犯罪提供的可操作信息极为有限;(2)模型安全机制对信息提供具有显著影响,其中两个经过微调移除了安全防护的开源权重大型语言模型提供了最具可操作性和实用性的响应;(3)将请求分解为看似无害的查询比显性恶意表述或基础系统级越狱能获得更多协助。总体而言,研究结果表明,当前文本生成模型通过信息供给为欺诈和网络犯罪提供的提升相对有限,除非投入大量精力规避安全防护。本研究贡献了一个可复现、基于专家知识的评估框架,可用于追踪随着模型能力提升和攻击者策略演变,此类风险可能发生的变化。