When AI Takes the Wheel: Security Analysis of Framework-Constrained Program Generation

In recent years, the AI wave has grown rapidly in software development. Even novice developers can now design and generate complex framework-constrained software systems based on their high-level requirements with the help of Large Language Models (LLMs). However, when LLMs gradually "take the wheel" of software development, developers may only check whether the program works. They often miss security problems hidden in how the generated programs are implemented. In this work, we investigate the security properties of framework-constrained programs generated by state-of-the-art LLMs. We focus specifically on Chrome extensions due to their complex security model involving multiple privilege boundaries and isolated components. To achieve this, we built ChromeSecBench, a dataset with 140 prompts based on known vulnerable extensions. We used these prompts to instruct nine state-of-the-art LLMs to generate complete Chrome extensions, and then analyzed them for vulnerabilities across three dimensions: scenario types, model differences, and vulnerability categories. Our results show that LLMs produced vulnerable programs at alarmingly high rates (18%-50%), particularly in Authentication & Identity and Cookie Management scenarios (up to 83% and 78% respectively). Most vulnerabilities exposed sensitive browser data like cookies, history, or bookmarks to untrusted code. Interestingly, we found that advanced reasoning models performed worse, generating more vulnerabilities than simpler models. These findings highlight a critical gap between LLMs' coding skills and their ability to write secure framework-constrained programs.

翻译：近年来，人工智能浪潮在软件开发领域迅猛发展。借助大型语言模型（LLMs），即使是新手开发者也能基于其高层需求设计和生成复杂的框架约束软件系统。然而，当LLMs逐渐“执掌”软件开发的方向盘时，开发者可能仅检查程序是否运行正常，而常常忽略生成程序实现方式中隐藏的安全问题。本研究旨在探究由前沿LLMs生成的框架约束程序的安全特性。我们特别关注Chrome扩展程序，因其涉及多重权限边界和隔离组件的复杂安全模型。为此，我们构建了ChromeSecBench数据集，包含基于已知漏洞扩展的140个提示。使用这些提示指导九种前沿LLMs生成完整的Chrome扩展程序，并从三个维度分析其漏洞：场景类型、模型差异和漏洞类别。结果显示，LLMs生成含漏洞程序的比例高得惊人（18%-50%），尤其在身份认证与身份管理以及Cookie管理场景中（分别高达83%和78%）。大多数漏洞将敏感浏览器数据（如Cookie、历史记录或书签）暴露给不可信代码。有趣的是，我们发现高级推理模型表现更差，生成的漏洞比简单模型更多。这些发现凸显了LLMs编码能力与其编写安全框架约束程序能力之间的关键差距。