STELP: Secure Transpilation and Execution of LLM-Generated Programs

Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.

翻译：大型语言模型（LLM）的快速发展在推理、规划与函数调用能力方面取得了重大进展。利用此类LLM的多智能体协作框架将其置于解决代码生成等软件开发相关任务的核心位置。然而，在生产软件开发系统中直接使用LLM生成的代码存在诸多问题。此类代码可能不稳定或存在错误，并包含数据污染、恶意攻击和幻觉等漏洞，可能导致大规模系统故障。这使得LLM生成的代码难以应用于生产级AI系统，因为在这些系统中人工代码审查和传统安全测试工具既不切实际也不可靠。本文探讨了LLM生成代码执行过程中的安全性与可靠性问题，并提出了一种安全转译与执行LLM生成程序的框架（STELP），能够在受控安全环境下执行LLM生成的代码。STELP为涉及代码生成的自主生产AI系统提供安全保障，填补了传统安全测试方法及人工监督在实际应用中的局限性所留下的关键空白。其应用场景包括无头代码生成-执行系统，以及实时执行生成代码片段作为行动计划的LLM。我们贡献了一个经人工验证的不安全代码片段数据集，并在公开数据集上对方法的正确性、安全性和延迟进行了基准测试。实验结果表明，我们的方法显著优于现有方案，尤其在安全执行高风险代码片段方面表现突出。警告：本文包含恶意代码片段，执行时需谨慎。