Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.
翻译:大语言模型(LLMs)已成为现代软件开发的关键组成部分,但其训练依赖未经筛选的网络规模数据集,引入了重大的安全风险:恶意内容的吸收与再现。为系统评估此风险,我们提出了Scam2Prompt,一种可扩展的自动化审计框架。该框架首先识别诈骗网站的根本意图,随后合成与该意图匹配的无害开发者风格提示词,从而测试大语言模型是否会针对这些无害提示生成恶意代码。通过对四个生产级大语言模型(GPT-4o、GPT-4o-mini、Llama-4-Scout和DeepSeek-V3)的大规模研究,我们发现Scam2Prompt生成的无害提示词在4.24%的案例中触发了恶意URL生成。为验证此安全风险的持续性,我们构建了Innoc2Scam-bench基准测试集,包含1,559个能持续引发所有四个初始大语言模型生成恶意代码的无害提示。将该基准应用于2025年发布的另外七个生产级大语言模型时,我们发现该漏洞不仅普遍存在且极为严重,恶意代码生成率介于12.7%至43.8%之间。此外,现有安全措施(如最先进的防护机制)均未能有效阻止该行为,整体检测率低于0.3%。