Recent advances in large language models (LLMs) have substantially enhanced automated code generation across a wide range of programming languages. Nonetheless, verifying the correctness and executability of LLM-generated code remains a significant challenge, as traditional methods rely on language-specific compilers and environment-dependent runtimes. To overcome these limitations, we introduce StackPilot, an LLM-native, multi-agent framework designed for language-agnostic code verification and execution, which operates independently of conventional toolchains. StackPilot offers three principal innovations: (1) a Function-as-Agents paradigm, in which each function is modeled as an autonomous agent capable of fine-grained reasoning and collaborative verification; (2) an LLM-as-Executor strategy, which enables scalable verification via stack-based scheduling; and (3) a novel snapshot mechanism that preserves complete execution contexts, facilitating deterministic and lossless context switching during verification. Empirical evaluations demonstrate that StackPilot achieves framework reliability rates between 89% and 97%, substantially outperforming baseline approaches. These results indicate that StackPilot can reliably verify and execute a significantly larger proportion of LLM-generated code across diverse programming tasks compared to existing methods.
翻译:近年来,大语言模型(LLM)的发展显著提升了跨多种编程语言的自动化代码生成能力。然而,验证LLM生成代码的正确性与可执行性仍面临重大挑战,因为传统方法依赖于特定语言的编译器及环境相关的运行时系统。为突破这些限制,我们提出了StackPilot——一种基于LLM原生的多智能体框架,专为语言无关的代码验证与执行而设计,其运行完全独立于传统工具链。StackPilot具备三项核心创新:(1)函数即智能体范式,将每个函数建模为能够进行细粒度推理与协同验证的自主智能体;(2)LLM即执行器策略,通过基于堆栈的调度机制实现可扩展的验证;(3)创新的快照机制,可完整保存执行上下文,在验证过程中实现确定性与无损的上下文切换。实证评估表明,StackPilot的框架可靠率达到89%至97%,显著优于基线方法。这些结果表明,相较于现有方法,StackPilot能够可靠地验证并执行更广泛编程任务中LLM生成的代码。