RedCode: Risky Code Execution and Generation Benchmark for Code Agents

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.

翻译：随着代码智能体在AI辅助编程中的能力与应用迅速增长，生成或执行风险代码等安全问题已成为阻碍此类智能体实际部署的重要障碍。为对代码智能体的安全性进行全面且实用的评估，我们提出了RedCode——一个针对风险代码执行与生成的基准：(1) RedCode-Exec 提供可能引发风险代码执行的挑战性提示，旨在评估代码智能体识别与处理不安全代码的能力。我们在Python和Bash任务中提供了总计4,050个风险测试用例，涵盖代码片段和自然文本等多种输入格式。这些用例覆盖网站、文件系统等8个领域的25类关键漏洞。我们提供了Docker环境并设计了相应评估指标以检验其执行结果。(2) RedCode-Gen 提供160个以函数签名和文档字符串为输入的提示，用于评估代码智能体是否会遵循指令生成有害代码或软件。通过对基于19个大型语言模型的三种智能体框架进行评估，我们的实证研究揭示了代码智能体的脆弱性。例如，RedCode-Exec的评估表明：智能体更倾向于拒绝执行针对操作系统的风险操作，但对存在技术缺陷的代码拒绝执行的可能性较低，这表明了高风险性。以自然文本描述的风险操作比代码格式的操作具有更低的拒绝率。此外，RedCode-Gen的评估显示，能力更强的基础模型（如GPT4）及整体编码能力更优的智能体，往往能生成更复杂且有效的有害软件。我们的研究结果强调了对多样化代码智能体进行严格安全评估的必要性。本数据集与代码已开源：https://github.com/AI-secure/RedCode。